- Notifications
You must be signed in to change notification settings - Fork96
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
License
ajaymache/data-analysis-using-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Data Analysis or sometimes referred to asexploratory data analysis (EDA) is one of the core components of data science. It is also the part on which data scientists, data engineers and data analysts spend their majority of the time which makes it extremely important in the field of data science. This repository demonstartes some common exploratory data analysis methods and techniques using python. For purpose of illustration theused car database dataset has been taken from kaggle since it is one of the ideal dataset for performingEDA and taking a step towards the most amazing and interesting field of data science. Good luck with yourEDA on theused car database dataset.
- The dataset is taken fromkaggle and contains details of theused cars in germany which are on sale onebay.
- The dataset is not clean and hence a lot of data cleaning is carried out. For e.g. prices where too high which are replaced by the median and outliers are removed accordingly.
- Also vehicles whose registration year wasgreater than 2016 andless than 1890 were removed from the dataset as this data is inconsistense and would yield incorrect results.
- The dataset is cleaned and stored in aCleanData folder which contains the entire cleaned dataset named ascleaned_autos.csv and another folder namedDataForAnalysis containing files structures containing subsets of the cleaned dataset based on brand of the vehicles and vehicle types.
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | kilometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2016-03-24 11:52:17 | Golf_3_1.6 | privat | Angebot | 480 | test | nan | 1993 | manuell | 0 | golf | 150000 | 0 | benzin | volkswagen | nan | 2016-03-24 00:00:00 | 0 | 70435 | 2016-04-07 03:16:57 |
2016-03-24 10:58:45 | A5_Sportback_2.7_Tdi | privat | Angebot | 18300 | test | coupe | 2011 | manuell | 190 | nan | 125000 | 5 | diesel | audi | ja | 2016-03-24 00:00:00 | 0 | 66954 | 2016-04-07 01:46:50 |
2016-03-14 12:52:21 | Jeep_Grand_Cherokee_"Overland" | privat | Angebot | 9800 | test | suv | 2004 | automatik | 163 | grand | 125000 | 8 | diesel | jeep | nan | 2016-03-14 00:00:00 | 0 | 90480 | 2016-04-05 12:47:46 |
The main folder contains 9 folders.
- Folders from Analysis1 - Analysis5 contain theiPython Notebook,python scripts along with thePlots for that analysis.
- Folder forshell scripts which automate the creation of files structures and splitting the data as mentioned above.
- Datapreparation folder contains theDatapreparation iPython Script for cleaning of data.
- CleanData folder contains the clean dataset and subsets of data as per thefile structure.
- RawData folder which contains theraw dataset.
Analysis 1 Analysis1.py Analysis1.ipynb Plots
- This analysis gives the distribution of prices of vehicles based on vehicles types.
- Output before the cleaning the data is shown below in order to highlight the importance of cleaning this dataset.
- Histogram andKDE before performing data cleaning.
- It is clearly visible that the dataset hasmany outliers andinconsistent data as year of registrationcannot be more than 2016 and less than 1890.
Boxplot of prices of vehicles based on the type of vehicles after cleaning the dataset. Based on the vehicle type how the prices vary is depictable from the boxplot. low, 25th, 50th(Median), 75th percentile, high can be estimated from this boxplot.
Analysis 2 Analysis2.py Analysis2.ipynb Plots
- This analysis gives the number of cars which are available for sale in the entire dataset based on a particular brand.
Barplot of average price of the vehicles for sale based of the type of the vehicle as well as based on the gearbox of the vehicle.
Analysis 3 Analysis3.py Analysis3.ipynb Plots
- This analysis gives the average number of price for the vehicles based on the fueltype of the vehicle and also based on the type of the vehicle.
Barplot of average power of the vehicle based of the fueltype of the vehicle and also on the type of the vehicle.
Analysis 4 Analysis4.py Analysis4.ipynb Plots
- This analysis gives you the average price of the brand of vehicles and their types which are likely to be found in the dataset.
Analysis 5 Analysis5.py Analysis5.ipynb Plots
- This analysis gives you the distribution of the total no of days a partiular vehicle has been online for sale before it was purchased.
- This is adynamic analysis and can be applied toany vehicle by specifying the brand of choice as argument to the python script.
- To run this file on your terminal type:Analysis5.py 'brand'
- where'brand' is the choice of brand vehicle you would like to see analysis about from the column'brand' in the dataset.
Analysis 1
- Manyoutliers withregistration year greater than 2016 and less than 1890 which are removed to make the dataset ready for analyis.
- Vehicles with registration year1990-2016 are availablemaximum for sale. Year2000 being thehighest with24313 vehicles.
Analysis 2
- Vehicles of typeSUV andCabrio are themost expensive withgreater than $5000 as compared toCoupe,Bus etc which aremoderately expensive in the range of$2650 to $5000 where as theleast expensive beingAndere andOthers with priceless than $1800 on anaverage.
- Vehicles of brandsVolkswagen,Opel andBMW are themaximum for sale in the decreasing order withVolkswagen being the maximum.
- As a general trend vehicles which areautomatic are themost expensive as compared to manual and other unspecified gearbox type.
Analysis 3
- Average prices of vehicles that areHybrid aremost expensive as compared to other fuel types like Diesel and Gasoline
- SUV type of vehicles with gearbox typeautomatic has themaximum power andKleinwagen with theleast.
Analysis 4
- Vehicles of brandAudi and typeSUV are themost expensive of the avialable vehicles for sale.
- Vehicles of brandPorsche and typeKleinwagen are theleast expensive of the available vehicles for sale.
Analysis 5
- Based on selected brand of choice, it can be found out whattype of vehicles in theselected brand tend to getsold quickly online as compared to others.
About
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.