Commit7ed6137

authored

Merge pull request#418 from SimranShaikh20/data_analysis

Data analysis added eda enhancement to streamlit app.py file

2 parents0d70e89 +e40388f commit7ed6137Copy full SHA for 7ed6137

File tree

3 files changed

+786

-3

lines changed

OpenSourceEda.ipynb
opensource_analysis
- README
- app.py

3 files changed

+786

-3

lines changed

`‎OpenSourceEda.ipynb‎`

Lines changed: 596 additions & 0 deletions

Large diffs are not rendered by default.

`‎opensource_analysis/README‎`

Lines changed: 79 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -18,3 +18,82 @@ streamlit run app.py`
`18`	`18`
`19`	`19`	`## Access the App`
`20`	`20`	`Open the URL http://localhost:8501 in your web browser to access the Streamlit app`
	`21`	`+`
	`22`	`+`
	`23`	`+# Survey Data EDA and Machine Learning App`
	`24`	`+`
	`25`	`+This repository contains an application built using Streamlit to explore and analyze survey data from developers. The app performs Exploratory Data Analysis (EDA) and includes visualizations of key data features. Additionally, it can be extended to support machine learning tasks like prediction.`
	`26`	`+`
	`27`	`+## Table of Contents`
	`28`	`+- [Overview](#overview)`
	`29`	`+- [Installation](#installation)`
	`30`	`+- [Dataset](#dataset)`
	`31`	`+- [Features](#features)`
	`32`	`+ - [1. Data Loading](#1-data-loading)`
	`33`	`+ - [2. Basic Information](#2-basic-information)`
	`34`	`+ - [3. Categorical Value Counts](#3-categorical-value-counts)`
	`35`	`+ - [4. Visualizations](#4-visualizations)`
	`36`	`+ - [5. Correlation Heatmap](#5-correlation-heatmap)`
	`37`	`+ - [6. Cumulative Distribution](#6-cumulative-distribution)`
	`38`	`+- [Usage](#usage)`
	`39`	`+- [Future Improvements](#future-improvements)`
	`40`	`+- [License](#license)`
	`41`	`+`
	`42`	`+## Overview`
	`43`	`+This project provides an interactive web-based application that allows users to explore a dataset of developer survey results. The app is built using Streamlit and includes several exploratory data analysis features, such as visualizing distributions of different variables (e.g., salary, job satisfaction, age). It also displays the relationship between various factors, such as job satisfaction and company size, and can be extended to machine learning tasks.`
	`44`	`+`
	`45`	`+## Dataset`
	`46`	`+The dataset used in this project is a sample from the 2018 Developer Survey Results. It contains various columns such as:`
	`47`	+- `Country`: Respondent's country.
	`48`	+- `Employment`: Employment status of the respondent.
	`49`	+- `ConvertedSalary`: Salary converted into USD.
	`50`	+- `DevType`: Developer types (e.g., web developer, data scientist).
	`51`	+- `LanguageWorkedWith`: Programming languages the respondent has worked with.
	`52`	+- `CompanySize`: Size of the company the respondent works for.
	`53`	+- `JobSatisfaction`: Job satisfaction rating on a scale.
	`54`	+- `CareerSatisfaction`: Career satisfaction rating.
	`55`	`+`
	`56`	`+## Features`
	`57`	`+`
	`58`	`+### 1. Data Loading`
	`59`	`+The application loads the dataset (CSV file) and fills in missing values where necessary. If the file is not found, an error message will be displayed on the app.`
	`60`	`+`
	`61`	`+### 2. Basic Information`
	`62`	`+Displays essential information about the dataset, including:`
	`63`	+- General structure of the data (`df.info()`).
	`64`	+- Descriptive statistics (`df.describe()`).
	`65`	`+`
	`66`	`+### 3. Categorical Value Counts`
	`67`	+For the categorical columns (`Country`, `Employment`, `DevType`, `LanguageWorkedWith`), the app shows the distribution of values using value counts and percentages.
	`68`	`+`
	`69`	`+### 4. Visualizations`
	`70`	`+The app provides the following visualizations to explore the data:`
	`71`	`+- Salary Distribution: A histogram with kernel density estimation (KDE) to visualize salary distribution.`
	`72`	+- Job Satisfaction Analysis: Bar charts for `JobSatisfaction` and `CareerSatisfaction`.
	`73`	`+- Programming Languages: The top 10 most-used programming languages among respondents.`
	`74`	`+- Job Satisfaction by Company Size: A box plot showing the relationship between company size and job satisfaction.`
	`75`	`+- Age Distribution: A histogram with KDE to show the age distribution of respondents.`
	`76`	`+- Country Distribution: A line plot showing the top 10 countries by the number of respondents.`
	`77`	`+- Employment Status: A pie chart showing the employment status distribution.`
	`78`	`+- Database Usage: A bar chart of the top 10 databases used by respondents.`
	`79`	`+- Job Satisfaction by Gender: A bar chart comparing job satisfaction across genders.`
	`80`	`+`
	`81`	`+### 5. Correlation Heatmap`
	`82`	`+Displays a heatmap showing the correlation between numerical variables in the dataset.`
	`83`	`+`
	`84`	`+### 6. Cumulative Distribution`
	`85`	`+Provides an Empirical Cumulative Distribution Function (ECDF) plot for the first numerical column in the dataset.`
	`86`	`+`
	`87`	`+## Usage`
	`88`	`+After launching the app:`
	`89`	`+1. The app loads the dataset and displays key information and visualizations on the home page.`
	`90`	`+2. Navigate through the sections to explore different parts of the dataset interactively.`
	`91`	`+3. The app is designed to be modular, allowing for future extensions, such as adding machine learning models for prediction tasks.`
	`92`	`+`
	`93`	`+## Future Improvements`
	`94`	+- Implement a machine learning model to predict job satisfaction or salary based on features like `Country`, `Employment`, `DevType`, etc.
	`95`	`+- Enhance the EDA with more detailed visualizations and insights.`
	`96`	`+- Allow users to upload their own dataset for customized analysis.`
	`97`	`+`
	`98`	`+## License`
	`99`	`+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.`

`‎opensource_analysis/app.py‎`

Lines changed: 111 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -64,8 +64,8 @@`
`64`	`64`
`65`	`65`	`# Evaluate the model`
`66`	`66`	`y_pred=model.predict(X_test)`
`67`		`-classification_rep=classification_report(y_test,y_pred)`
`68`		`-roc_auc=roc_auc_score(y_test,model.predict_proba(X_test)[:,1])`
	`67`	`+classification_rep=classification_report(y_test,y_pred,zero_division=1)`
	`68`	`+roc_auc=roc_auc_score(pd.get_dummies(y_test).values[:,1],model.predict_proba(X_test)[:,1])`
`69`	`69`
`70`	`70`	`# Get feature importance`
`71`	`71`	`importances=model.named_steps['classifier'].feature_importances_`
`@@ -94,7 +94,7 @@`
`94`	`94`
`95`	`95`	`# Plot ROC Curve`
`96`	`96`	`st.header('ROC Curve')`
`97`		`-y_test_binary=y_test.map({'No':0,'Yes':1})`
	`97`	`+y_test_binary=pd.get_dummies(y_test).values[:,1]# Convert to binary`
`98`	`98`	`fpr,tpr,_=roc_curve(y_test_binary,model.predict_proba(X_test)[:,1])`
`99`	`99`	`roc_auc=auc(fpr,tpr)`
`100`	`100`	`fig,ax=plt.subplots()`
`@@ -151,5 +151,113 @@`
`151`	`151`	`exceptExceptionase:`
`152`	`152`	`st.error(f"An error occurred during prediction:{e}")`
`153`	`153`
	`154`	`+# ================== EDA Enhancements ==================`
	`155`	`+st.header('Enhanced Exploratory Data Analysis (EDA)')`
	`156`	`+`
	`157`	`+# Load full dataset for EDA`
	`158`	`+eda_data=pd.read_csv(file_path)`
	`159`	`+`
	`160`	`+# Salary Analysis`
	`161`	`+st.subheader('Salary Distribution')`
	`162`	`+eda_data['ConvertedSalary']=pd.to_numeric(eda_data['ConvertedSalary'],errors='coerce')`
	`163`	`+fig,ax=plt.subplots()`
	`164`	`+sns.histplot(eda_data['ConvertedSalary'].dropna(),kde=True,ax=ax)`
	`165`	`+ax.set_title('Distribution of Salaries')`
	`166`	`+ax.set_xlabel('Salary (USD)')`
	`167`	`+st.pyplot(fig)`
	`168`	`+`
	`169`	`+# Job Satisfaction Analysis`
	`170`	`+satisfaction_cols= ['JobSatisfaction','CareerSatisfaction']`
	`171`	`+forcolinsatisfaction_cols:`
	`172`	`+st.subheader(f'Distribution of{col}')`
	`173`	`+fig,ax=plt.subplots()`
	`174`	`+eda_data[col].value_counts().plot(kind='bar',ax=ax)`
	`175`	`+ax.set_title(f'Distribution of{col}')`
	`176`	`+ax.set_xlabel('Satisfaction Level')`
	`177`	`+ax.set_ylabel('Count')`
	`178`	`+st.pyplot(fig)`
	`179`	`+`
	`180`	`+# Programming Languages Analysis`
	`181`	`+st.subheader('Top 10 Programming Languages')`
	`182`	`+languages=eda_data['LanguageWorkedWith'].str.split(';',expand=True).stack()`
	`183`	`+fig,ax=plt.subplots()`
	`184`	`+languages.value_counts().head(10).plot(kind='bar',ax=ax)`
	`185`	`+ax.set_title('Top 10 Programming Languages')`
	`186`	`+ax.set_xlabel('Language')`
	`187`	`+ax.set_ylabel('Count')`
	`188`	`+st.pyplot(fig)`
	`189`	`+`
	`190`	`+# Job Satisfaction by Company Size`
	`191`	`+st.subheader('Job Satisfaction by Company Size')`
	`192`	`+fig,ax=plt.subplots()`
	`193`	`+sns.boxplot(x='CompanySize',y='JobSatisfaction',data=eda_data,ax=ax)`
	`194`	`+ax.set_title('Job Satisfaction by Company Size')`
	`195`	`+ax.set_xlabel('Company Size')`
	`196`	`+ax.set_ylabel('Job Satisfaction')`
	`197`	`+st.pyplot(fig)`
	`198`	`+`
	`199`	`+# Age Distribution`
	`200`	`+st.subheader('Age Distribution of Respondents')`
	`201`	`+fig,ax=plt.subplots()`
	`202`	`+sns.histplot(eda_data['Age'],kde=True,ax=ax)`
	`203`	`+ax.set_title('Age Distribution of Respondents')`
	`204`	`+ax.set_xlabel('Age')`
	`205`	`+st.pyplot(fig)`
	`206`	`+`
	`207`	`+# Top 10 Countries of Respondents`
	`208`	`+st.subheader('Top 10 Countries of Respondents')`
	`209`	`+country_counts=eda_data['Country'].value_counts().head(10)`
	`210`	`+fig,ax=plt.subplots()`
	`211`	`+ax.plot(country_counts.index,country_counts.values,marker='o')`
	`212`	`+ax.set_title('Top 10 Countries of Respondents')`
	`213`	`+ax.set_xlabel('Country')`
	`214`	`+ax.set_ylabel('Number of Respondents')`
	`215`	`+st.pyplot(fig)`
	`216`	`+`
	`217`	`+# Employment Status Distribution`
	`218`	`+st.header("Employment Status Distribution")`
	`219`	`+employment_counts=eda_data['Employment'].value_counts()`
	`220`	`+fig,ax=plt.subplots()`
	`221`	`+ax.pie(employment_counts.values,labels=employment_counts.index,autopct='%1.1f%%')`
	`222`	`+ax.set_title('Employment Status Distribution')`
	`223`	`+ax.axis('equal')`
	`224`	`+st.pyplot(fig)`
	`225`	`+`
	`226`	`+# Databases Used`
	`227`	`+st.header("Top 10 Databases Used")`
	`228`	`+databases=eda_data['DatabaseWorkedWith'].str.split(';',expand=True).stack()`
	`229`	`+db_counts=databases.value_counts().head(10)`
	`230`	`+fig,ax=plt.subplots()`
	`231`	`+db_counts.plot(kind='barh',ax=ax)`
	`232`	`+ax.set_xlabel('Number of Users')`
	`233`	`+ax.set_ylabel('Database')`
	`234`	`+st.pyplot(fig)`
	`235`	`+`
	`236`	`+# Job Satisfaction by Gender`
	`237`	`+st.header("Job Satisfaction by Gender")`
	`238`	`+job_sat_gender=pd.crosstab(eda_data['JobSatisfaction'],eda_data['Gender'])`
	`239`	`+fig,ax=plt.subplots()`
	`240`	`+job_sat_gender.plot(kind='bar',ax=ax)`
	`241`	`+ax.set_title('Job Satisfaction by Gender')`
	`242`	`+ax.set_xlabel('Job Satisfaction Level')`
	`243`	`+st.pyplot(fig)`
	`244`	`+`
	`245`	`+# Correlation Heatmap`
	`246`	`+st.header("Correlation Heatmap of Numeric Variables")`
	`247`	`+numeric_columns=eda_data.select_dtypes(include=['int64','float64']).columns`
	`248`	`+fig,ax=plt.subplots()`
	`249`	`+sns.heatmap(eda_data[numeric_columns].corr(),annot=True,cmap='coolwarm',ax=ax)`
	`250`	`+ax.set_title('Correlation Heatmap of Numeric Variables')`
	`251`	`+st.pyplot(fig)`
	`252`	`+`
	`253`	`+# Cumulative Distribution`
	`254`	`+st.header(f"Cumulative Distribution of{numeric_columns[0]}")`
	`255`	`+fig,ax=plt.subplots()`
	`256`	`+sns.ecdfplot(data=eda_data,x=numeric_columns[0],ax=ax)`
	`257`	`+ax.set_title(f'Cumulative Distribution of{numeric_columns[0]}')`
	`258`	`+ax.set_xlabel(numeric_columns[0])`
	`259`	`+ax.set_ylabel('Cumulative Proportion')`
	`260`	`+st.pyplot(fig)`
	`261`	`+`
`154`	`262`	`exceptExceptionase:`
`155`	`263`	`st.error(f"An error occurred while loading data:{e}")`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit7ed6137

File tree

3 files changed

3 files changed

`‎OpenSourceEda.ipynb‎`

`‎opensource_analysis/README‎`

`‎opensource_analysis/app.py‎`

0 commit comments