oriser/pythoncode-tutorialsPublic

forked fromx4nth055/pythoncode-tutorials

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Commitb0ae13a

committed

add data cleaning tutorial

1 parentde76901 commitb0ae13aCopy full SHA for b0ae13a

File tree

14 files changed

+440

-0

lines changed

README.md
general/data-cleaning-pandas

14 files changed

+440

-0

lines changed

`‎README.md‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -120,6 +120,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy`
`120`	`120`	`-[How to Plot Weather Temperature in Python](https://www.thepythoncode.com/article/interactive-weather-plot-with-matplotlib-and-requests). ([code](general/interactive-weather-plot/))`
`121`	`121`	`-[How to Generate SVG Country Maps in Python](https://www.thepythoncode.com/article/generate-svg-country-maps-python). ([code](general/generate-svg-country-map))`
`122`	`122`	`-[How to Query the Ethereum Blockchain with Python](https://www.thepythoncode.com/article/query-ethereum-blockchain-with-python). ([code](general/query-ethereum))`
	`123`	`+-[Data Cleaning with Pandas in Python](https://www.thepythoncode.com/article/data-cleaning-using-pandas-in-python). ([code](general/data-cleaning-pandas))`
`123`	`124`
`124`	`125`
`125`	`126`

`‎general/data-cleaning-pandas/README.md‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+#[Data Cleaning with Pandas in Python](https://www.thepythoncode.com/article/data-cleaning-using-pandas-in-python)`

`‎general/data-cleaning-pandas/data_cleaning.py‎`

Lines changed: 10 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,10 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+print(data_frames.head(10))`

`‎general/data-cleaning-pandas/data_cleaning2.py‎`

Lines changed: 10 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,10 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+print(data_frames.info())`

`‎general/data-cleaning-pandas/data_cleaning3.py‎`

Lines changed: 18 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,18 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+print(data_frames.head(10))`
	`18`	`+print(data_frames.info())`

`‎general/data-cleaning-pandas/data_cleaning4.py‎`

Lines changed: 32 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,32 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+`
	`18`	`+# Handle Data Inconsistencies`
	`19`	`+# Capitalize strings`
	`20`	`+data_frames['street_address']=data_frames['street_address'].str.split()`
	`21`	`+`
	`22`	`+defcapitalize_words(arr):`
	`23`	`+forindex,wordinenumerate(arr):`
	`24`	`+ifindex==0:`
	`25`	`+pass`
	`26`	`+else:`
	`27`	`+arr[index]=word.capitalize()`
	`28`	`+`
	`29`	`+data_frames['street_address'].apply(lambdax:capitalize_words(x))`
	`30`	`+data_frames['street_address']=data_frames['street_address'].str.join(' ')`
	`31`	`+`
	`32`	`+print(data_frames['street_address'])`

`‎general/data-cleaning-pandas/data_cleaning5.py‎`

Lines changed: 41 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,41 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+`
	`18`	`+# Handle Data Inconsistencies`
	`19`	`+# Normalize strings`
	`20`	`+data_frames['street_address']=data_frames['street_address'].str.split()`
	`21`	`+`
	`22`	`+defnormalize_words(arr):`
	`23`	`+forindex,wordinenumerate(arr):`
	`24`	`+ifindex==0:`
	`25`	`+pass`
	`26`	`+else:`
	`27`	`+arr[index]=normalize(word)`
	`28`	`+`
	`29`	`+defnormalize(word):`
	`30`	`+ifword.lower()=='st':`
	`31`	`+word='street'`
	`32`	`+elifword.lower()=='rd':`
	`33`	`+word='road'`
	`34`	`+`
	`35`	`+returnword.capitalize()`
	`36`	`+`
	`37`	`+`
	`38`	`+data_frames['street_address'].apply(lambdax:normalize_words(x))`
	`39`	`+data_frames['street_address']=data_frames['street_address'].str.join(' ')`
	`40`	`+`
	`41`	`+print(data_frames.head(10))`

`‎general/data-cleaning-pandas/data_cleaning6.py‎`

Lines changed: 49 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,49 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+`
	`18`	`+# Handle Data Inconsistencies`
	`19`	`+# Normalize strings`
	`20`	`+data_frames['street_address']=data_frames['street_address'].str.split()`
	`21`	`+`
	`22`	`+defnormalize_words(arr):`
	`23`	`+forindex,wordinenumerate(arr):`
	`24`	`+ifindex==0:`
	`25`	`+pass`
	`26`	`+else:`
	`27`	`+arr[index]=normalize(word)`
	`28`	`+`
	`29`	`+defnormalize(word):`
	`30`	`+ifword.lower()=='st':`
	`31`	`+word='street'`
	`32`	`+elifword.lower()=='rd':`
	`33`	`+word='road'`
	`34`	`+`
	`35`	`+returnword.capitalize()`
	`36`	`+`
	`37`	`+`
	`38`	`+data_frames['street_address'].apply(lambdax:normalize_words(x))`
	`39`	`+data_frames['street_address']=data_frames['street_address'].str.join(' ')`
	`40`	`+`
	`41`	`+`
	`42`	`+# Remove Out-of-Range Data`
	`43`	`+# create boolean Series for out of range donations`
	`44`	`+out_of_range=data_frames['donation']<0`
	`45`	`+`
	`46`	`+# keep only the rows that are NOT out of range`
	`47`	`+data_frames['donation']=data_frames['donation'][~out_of_range]`
	`48`	`+`
	`49`	`+print(data_frames.head(10))`

`‎general/data-cleaning-pandas/data_cleaning7.py‎`

Lines changed: 54 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,54 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+`
	`18`	`+# Handle Data Inconsistencies`
	`19`	`+# Normalize strings`
	`20`	`+data_frames['street_address']=data_frames['street_address'].str.split()`
	`21`	`+`
	`22`	`+defnormalize_words(arr):`
	`23`	`+forindex,wordinenumerate(arr):`
	`24`	`+ifindex==0:`
	`25`	`+pass`
	`26`	`+else:`
	`27`	`+arr[index]=normalize(word)`
	`28`	`+`
	`29`	`+defnormalize(word):`
	`30`	`+ifword.lower()=='st':`
	`31`	`+word='street'`
	`32`	`+elifword.lower()=='rd':`
	`33`	`+word='road'`
	`34`	`+`
	`35`	`+returnword.capitalize()`
	`36`	`+`
	`37`	`+`
	`38`	`+data_frames['street_address'].apply(lambdax:normalize_words(x))`
	`39`	`+data_frames['street_address']=data_frames['street_address'].str.join(' ')`
	`40`	`+`
	`41`	`+`
	`42`	`+# Remove Out-of-Range Data`
	`43`	`+# create boolean Series for out of range donations`
	`44`	`+out_of_range=data_frames['donation']<0`
	`45`	`+`
	`46`	`+# keep only the rows that are NOT out of range`
	`47`	`+data_frames['donation']=data_frames['donation'][~out_of_range]`
	`48`	`+`
	`49`	`+`
	`50`	`+# Remove duplicates`
	`51`	`+columns_to_check= ['first_name','last_name','street_address','city','state']`
	`52`	`+data_frames_no_dupes=data_frames.drop_duplicates(subset=columns_to_check,keep='first')`
	`53`	`+`
	`54`	`+print(data_frames_no_dupes.info())`

`‎general/data-cleaning-pandas/data_cleaning8.py‎`

Lines changed: 60 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,60 @@`
	`1`	`+importpandasaspd`
	`2`	`+`
	`3`	`+# Config settings`
	`4`	`+pd.set_option('max_columns',None)`
	`5`	`+pd.set_option('max_rows',12)`
	`6`	`+`
	`7`	`+# Import CSV data`
	`8`	`+data_frames=pd.read_csv (r'simulated_data.csv')`
	`9`	`+`
	`10`	`+# Data Type Conversion`
	`11`	`+# Remove '$' from donation strings`
	`12`	`+data_frames['donation']=data_frames['donation'].str.strip('$')`
	`13`	`+`
	`14`	`+# Convert donation stings into numerical data type`
	`15`	`+data_frames['donation']=data_frames['donation'].astype('float64')`
	`16`	`+`
	`17`	`+`
	`18`	`+# Handle Data Inconsistencies`
	`19`	`+# Normalize strings`
	`20`	`+data_frames['street_address']=data_frames['street_address'].str.split()`
	`21`	`+`
	`22`	`+defnormalize_words(arr):`
	`23`	`+forindex,wordinenumerate(arr):`
	`24`	`+ifindex==0:`
	`25`	`+pass`
	`26`	`+else:`
	`27`	`+arr[index]=normalize(word)`
	`28`	`+`
	`29`	`+defnormalize(word):`
	`30`	`+ifword.lower()=='st':`
	`31`	`+word='street'`
	`32`	`+elifword.lower()=='rd':`
	`33`	`+word='road'`
	`34`	`+`
	`35`	`+returnword.capitalize()`
	`36`	`+`
	`37`	`+`
	`38`	`+data_frames['street_address'].apply(lambdax:normalize_words(x))`
	`39`	`+data_frames['street_address']=data_frames['street_address'].str.join(' ')`
	`40`	`+`
	`41`	`+`
	`42`	`+# Remove Out-of-Range Data`
	`43`	`+# create boolean Series for out of range donations`
	`44`	`+out_of_range=data_frames['donation']<0`
	`45`	`+`
	`46`	`+# keep only the rows that are NOT out of range`
	`47`	`+data_frames['donation']=data_frames['donation'][~out_of_range]`
	`48`	`+`
	`49`	`+`
	`50`	`+# Remove duplicates`
	`51`	`+columns_to_check= ['first_name','last_name','street_address','city','state']`
	`52`	`+data_frames_no_dupes=data_frames.drop_duplicates(subset=columns_to_check,keep='first')`
	`53`	`+`
	`54`	`+`
	`55`	`+# Drop Missing Data`
	`56`	`+columns_to_check= ['state','donation']`
	`57`	`+data_frames_no_missing=data_frames_no_dupes.dropna(subset=columns_to_check)`
	`58`	`+`
	`59`	`+`
	`60`	`+print(data_frames_no_missing.head(20))`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitb0ae13a

File tree

14 files changed

14 files changed

`‎README.md‎`

`‎general/data-cleaning-pandas/README.md‎`

`‎general/data-cleaning-pandas/data_cleaning.py‎`

`‎general/data-cleaning-pandas/data_cleaning2.py‎`

`‎general/data-cleaning-pandas/data_cleaning3.py‎`

`‎general/data-cleaning-pandas/data_cleaning4.py‎`

`‎general/data-cleaning-pandas/data_cleaning5.py‎`

`‎general/data-cleaning-pandas/data_cleaning6.py‎`

`‎general/data-cleaning-pandas/data_cleaning7.py‎`

`‎general/data-cleaning-pandas/data_cleaning8.py‎`

0 commit comments