Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitb0ae13a

Browse files
committed
add data cleaning tutorial
1 parentde76901 commitb0ae13a

File tree

14 files changed

+440
-0
lines changed

14 files changed

+440
-0
lines changed

‎README.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
120120
-[How to Plot Weather Temperature in Python](https://www.thepythoncode.com/article/interactive-weather-plot-with-matplotlib-and-requests). ([code](general/interactive-weather-plot/))
121121
-[How to Generate SVG Country Maps in Python](https://www.thepythoncode.com/article/generate-svg-country-maps-python). ([code](general/generate-svg-country-map))
122122
-[How to Query the Ethereum Blockchain with Python](https://www.thepythoncode.com/article/query-ethereum-blockchain-with-python). ([code](general/query-ethereum))
123+
-[Data Cleaning with Pandas in Python](https://www.thepythoncode.com/article/data-cleaning-using-pandas-in-python). ([code](general/data-cleaning-pandas))
123124

124125

125126

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
#[Data Cleaning with Pandas in Python](https://www.thepythoncode.com/article/data-cleaning-using-pandas-in-python)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
print(data_frames.head(10))
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
print(data_frames.info())
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
print(data_frames.head(10))
18+
print(data_frames.info())
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
18+
# Handle Data Inconsistencies
19+
# Capitalize strings
20+
data_frames['street_address']=data_frames['street_address'].str.split()
21+
22+
defcapitalize_words(arr):
23+
forindex,wordinenumerate(arr):
24+
ifindex==0:
25+
pass
26+
else:
27+
arr[index]=word.capitalize()
28+
29+
data_frames['street_address'].apply(lambdax:capitalize_words(x))
30+
data_frames['street_address']=data_frames['street_address'].str.join(' ')
31+
32+
print(data_frames['street_address'])
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
18+
# Handle Data Inconsistencies
19+
# Normalize strings
20+
data_frames['street_address']=data_frames['street_address'].str.split()
21+
22+
defnormalize_words(arr):
23+
forindex,wordinenumerate(arr):
24+
ifindex==0:
25+
pass
26+
else:
27+
arr[index]=normalize(word)
28+
29+
defnormalize(word):
30+
ifword.lower()=='st':
31+
word='street'
32+
elifword.lower()=='rd':
33+
word='road'
34+
35+
returnword.capitalize()
36+
37+
38+
data_frames['street_address'].apply(lambdax:normalize_words(x))
39+
data_frames['street_address']=data_frames['street_address'].str.join(' ')
40+
41+
print(data_frames.head(10))
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
18+
# Handle Data Inconsistencies
19+
# Normalize strings
20+
data_frames['street_address']=data_frames['street_address'].str.split()
21+
22+
defnormalize_words(arr):
23+
forindex,wordinenumerate(arr):
24+
ifindex==0:
25+
pass
26+
else:
27+
arr[index]=normalize(word)
28+
29+
defnormalize(word):
30+
ifword.lower()=='st':
31+
word='street'
32+
elifword.lower()=='rd':
33+
word='road'
34+
35+
returnword.capitalize()
36+
37+
38+
data_frames['street_address'].apply(lambdax:normalize_words(x))
39+
data_frames['street_address']=data_frames['street_address'].str.join(' ')
40+
41+
42+
# Remove Out-of-Range Data
43+
# create boolean Series for out of range donations
44+
out_of_range=data_frames['donation']<0
45+
46+
# keep only the rows that are NOT out of range
47+
data_frames['donation']=data_frames['donation'][~out_of_range]
48+
49+
print(data_frames.head(10))
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
18+
# Handle Data Inconsistencies
19+
# Normalize strings
20+
data_frames['street_address']=data_frames['street_address'].str.split()
21+
22+
defnormalize_words(arr):
23+
forindex,wordinenumerate(arr):
24+
ifindex==0:
25+
pass
26+
else:
27+
arr[index]=normalize(word)
28+
29+
defnormalize(word):
30+
ifword.lower()=='st':
31+
word='street'
32+
elifword.lower()=='rd':
33+
word='road'
34+
35+
returnword.capitalize()
36+
37+
38+
data_frames['street_address'].apply(lambdax:normalize_words(x))
39+
data_frames['street_address']=data_frames['street_address'].str.join(' ')
40+
41+
42+
# Remove Out-of-Range Data
43+
# create boolean Series for out of range donations
44+
out_of_range=data_frames['donation']<0
45+
46+
# keep only the rows that are NOT out of range
47+
data_frames['donation']=data_frames['donation'][~out_of_range]
48+
49+
50+
# Remove duplicates
51+
columns_to_check= ['first_name','last_name','street_address','city','state']
52+
data_frames_no_dupes=data_frames.drop_duplicates(subset=columns_to_check,keep='first')
53+
54+
print(data_frames_no_dupes.info())
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
importpandasaspd
2+
3+
# Config settings
4+
pd.set_option('max_columns',None)
5+
pd.set_option('max_rows',12)
6+
7+
# Import CSV data
8+
data_frames=pd.read_csv (r'simulated_data.csv')
9+
10+
# Data Type Conversion
11+
# Remove '$' from donation strings
12+
data_frames['donation']=data_frames['donation'].str.strip('$')
13+
14+
# Convert donation stings into numerical data type
15+
data_frames['donation']=data_frames['donation'].astype('float64')
16+
17+
18+
# Handle Data Inconsistencies
19+
# Normalize strings
20+
data_frames['street_address']=data_frames['street_address'].str.split()
21+
22+
defnormalize_words(arr):
23+
forindex,wordinenumerate(arr):
24+
ifindex==0:
25+
pass
26+
else:
27+
arr[index]=normalize(word)
28+
29+
defnormalize(word):
30+
ifword.lower()=='st':
31+
word='street'
32+
elifword.lower()=='rd':
33+
word='road'
34+
35+
returnword.capitalize()
36+
37+
38+
data_frames['street_address'].apply(lambdax:normalize_words(x))
39+
data_frames['street_address']=data_frames['street_address'].str.join(' ')
40+
41+
42+
# Remove Out-of-Range Data
43+
# create boolean Series for out of range donations
44+
out_of_range=data_frames['donation']<0
45+
46+
# keep only the rows that are NOT out of range
47+
data_frames['donation']=data_frames['donation'][~out_of_range]
48+
49+
50+
# Remove duplicates
51+
columns_to_check= ['first_name','last_name','street_address','city','state']
52+
data_frames_no_dupes=data_frames.drop_duplicates(subset=columns_to_check,keep='first')
53+
54+
55+
# Drop Missing Data
56+
columns_to_check= ['state','donation']
57+
data_frames_no_missing=data_frames_no_dupes.dropna(subset=columns_to_check)
58+
59+
60+
print(data_frames_no_missing.head(20))

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp