You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
2019 Novel Coronavirus (COVID-19) has infected more than 1 million people in theworld. We want to do something against this pestilence by using what we have learnt.This datasets consists of 11 CSV files from the KCDC (Korea Centers for Disease Control & Prevention), a structureddataset based on the report materials of KCDC and local governments was provided by Korean researchers.
There are 3128 rows in the dataset PatientInfo.csvI notice that there are many miss values in many colums, and I decide first observe characteristics of the data by visualization and then deal with the miss value.
<matplotlib.axes._subplots.AxesSubplot at 0x229693ed588>
Top ten cities with the most patients
city=PatientInfo['city'].value_counts()city=city.sort_values(ascending=False)[:10]city.sort_values(ascending=True).plot.barh(fontsize=15)plt.title('Top ten cities with the most patients',size=15)plt.show()
Top ten province with the most patients
province=PatientInfo['province'].value_counts()province=province.sort_values(ascending=False)[:10]province.sort_values(ascending=True).plot.barh(fontsize=15)plt.title('Top ten provinces with the most patients',size=15)plt.show()
Top ten infection_case of patients
infection_case=PatientInfo['infection_case'].value_counts()infection_case=infection_case.sort_values(ascending=False)[:10]infection_case.sort_values(ascending=True).plot.barh(fontsize=15)plt.title('Top ten infection_case of patients',size=15)plt.show()
The relation between sex and state among patients.
<matplotlib.axes._subplots.AxesSubplot at 0x2296a8ba348>
The relation between the top five infection_case with most patients and state among patients.
top_five_infection_case=PatientInfo[PatientInfo['infection_case'].isin(['contact with patient','etc','overseas inflow','Guro-gu Call Center','Shincheonji Church'])]sns.countplot(x="state",hue="infection_case",data=top_five_infection_case)
<matplotlib.axes._subplots.AxesSubplot at 0x2296b922708>
Conclusion:
After preliminary data visualization analysis, we can find that the number of female patients is about 300 larger than the number of male patients, most patients are aged from 20 to 60, most patients are form the city Gyeongsan-si, most patients are form the provinces Gyeongsangbuk-do, Gyeonggi-do and Seoul and most infection cases are contact with patient, etc and overseas inflow.
By analysing the states of patients(isolated / released / deceased) and other attributes, we can find that male patients may be a little easier to be deceased than female patients because the number of female patients is larger than the number of male patients but the number of deceased female patients is less than the number of deceased male patients.
We can also find that younger patients may be easier to be released than older patients and older patients may be easier to be deceased than older patients.
We can also find that contacting with patients around you in your city and province is the main reason people get infected.
fig,ax=plt.subplots(figsize=(12,6))Time.plot(marker='o',ms=2,lw=1,ax=ax)fig.autofmt_xdate()plt.legend(bbox_to_anchor= [1,1])plt.title('COVID-19 Data of South Korea',size=15)plt.ylabel('number')plt.grid(axis='y')plt.show()
The figure is not very clear because the accumulated test and negative number is much larger that the rest numbers.Hence we will just focus on the number of confirmed, released and deceased patients.
Time.drop(['test','negative'],axis=1,inplace=True)fig,ax=plt.subplots(figsize=(12,6))Time.plot(marker='o',ms=2,lw=1,ax=ax)fig.autofmt_xdate()plt.legend(bbox_to_anchor= [1,1])plt.title('COVID-19 Data of South Korea',size=15)plt.ylabel('number')plt.grid(axis='y')plt.show()
#processing of Dataframe TimeAge to make it easy to draw clear figures.Age0=TimeAge[TimeAge['age']=='0s']Age0=Age0.rename(columns={'confirmed':'confirmed_0s'})Age0=Age0.rename(columns={'deceased':'deceased_0s'})Age1=TimeAge[TimeAge['age']=='10s']Age1=Age1.rename(columns={'confirmed':'confirmed_10s'})Age1=Age1.rename(columns={'deceased':'deceased_10s'})Age2=TimeAge[TimeAge['age']=='20s']Age2=Age2.rename(columns={'confirmed':'confirmed_20s'})Age2=Age2.rename(columns={'deceased':'deceased_20s'})Age3=TimeAge[TimeAge['age']=='30s']Age3=Age3.rename(columns={'confirmed':'confirmed_30s'})Age3=Age3.rename(columns={'deceased':'deceased_30s'})Age4=TimeAge[TimeAge['age']=='40s']Age4=Age4.rename(columns={'confirmed':'confirmed_40s'})Age4=Age4.rename(columns={'deceased':'deceased_40s'})Age5=TimeAge[TimeAge['age']=='50s']Age5=Age5.rename(columns={'confirmed':'confirmed_50s'})Age5=Age5.rename(columns={'deceased':'deceased_50s'})Age6=TimeAge[TimeAge['age']=='60s']Age6=Age6.rename(columns={'confirmed':'confirmed_60s'})Age6=Age6.rename(columns={'deceased':'deceased_60s'})Age7=TimeAge[TimeAge['age']=='70s']Age7=Age7.rename(columns={'confirmed':'confirmed_70s'})Age7=Age7.rename(columns={'deceased':'deceased_70s'})Age8=TimeAge[TimeAge['age']=='80s']Age8=Age8.rename(columns={'confirmed':'confirmed_80s'})Age8=Age8.rename(columns={'deceased':'deceased_80s'})result=pd.merge(Age0,Age1,on='date')result=pd.merge(result,Age2,on='date')result=pd.merge(result,Age3,on='date')result=pd.merge(result,Age4,on='date')result=pd.merge(result,Age5,on='date')result=pd.merge(result,Age6,on='date')result=pd.merge(result,Age7,on='date')result=pd.merge(result,Age8,on='date')result.set_index(['date'],inplace=True)result.head(5)
fig,ax=plt.subplots(figsize=(16,10))result.plot(marker='o',ms=2,lw=1,ax=ax)fig.autofmt_xdate()plt.legend(bbox_to_anchor= [1,1])plt.title('COVID-19 Data of Different Ages',size=15)plt.ylabel('number')plt.grid(axis='y')plt.show()
#The processing is the same as the one above.TimeGender.drop(['time'],axis=1,inplace=True)male=TimeGender[TimeGender['sex']=='male']male=male.rename(columns={'confirmed':'confirmed_male'})male=male.rename(columns={'deceased':'deceased_male'})female=TimeGender[TimeGender['sex']=='female']female=female.rename(columns={'confirmed':'confirmed_female'})female=female.rename(columns={'deceased':'deceased_female'})result=pd.merge(female ,male,on='date')result.set_index(['date'],inplace=True)result.head(5)
fig,ax=plt.subplots(figsize=(16,8))result.plot(marker='o',ms=2,lw=1,ax=ax)fig.autofmt_xdate()plt.legend(bbox_to_anchor= [1,1])plt.title('COVID-19 Data of Different Genders',size=15)plt.ylabel('number')plt.grid(axis='y')plt.show()
#It will be a mass if we draw the imformatin of all provinces, so we choose to draw the figure of the top 5 provinces with most patients.TimeProvince.drop(['time'],axis=1,inplace=True)p1=TimeProvince[TimeProvince['province']=='Gyeongsangbuk-do']p1=p1.rename(columns={'confirmed':'confirmed_Gyeongsangbuk-do'})p1=p1.rename(columns={'released':'released_Gyeongsangbuk-do'})p1=p1.rename(columns={'deceased':'deceased_Gyeongsangbuk-do'})p2=TimeProvince[TimeProvince['province']=='Gyeonggi-do']p2=p2.rename(columns={'confirmed':'confirmed_Gyeonggi-do'})p2=p2.rename(columns={'released':'released_Gyeonggi-do'})p2=p2.rename(columns={'deceased':'deceased_Gyeonggi-do'})p3=TimeProvince[TimeProvince['province']=='Seoul']p3=p3.rename(columns={'confirmed':'confirmed_Seoul'})p3=p3.rename(columns={'released':'released_Seoul'})p3=p3.rename(columns={'deceased':'deceased_Seoul'})p4=TimeProvince[TimeProvince['province']=='Chungcheongnam-do']p4=p4.rename(columns={'confirmed':'confirmed_Chungcheongnam-do'})p4=p4.rename(columns={'released':'released_Chungcheongnam-do'})p4=p4.rename(columns={'deceased':'deceased_Chungcheongnam-do'})p5=TimeProvince[TimeProvince['province']=='Busan']p5=p5.rename(columns={'confirmed':'confirmed_Busan'})p5=p5.rename(columns={'released':'released_Busan'})p5=p5.rename(columns={'deceased':'deceased_Busan'})result=pd.merge(p1,p2,on='date')result=pd.merge(result,p3,on='date')result=pd.merge(result,p4,on='date')result=pd.merge(result,p5,on='date')result.set_index(['date'],inplace=True)result.head(5)
fig,ax=plt.subplots(figsize=(16,10))result.plot(marker='o',ms=2,lw=1,ax=ax)fig.autofmt_xdate()plt.legend(bbox_to_anchor= [1,1])plt.title('COVID-19 Data of Different Provinces',size=15)plt.ylabel('number')plt.grid(axis='y')plt.show()
Conclusion:
Through these time series figures, I find that from February 22, 2020, South Korea began to increase the number of daily tests, and the number of newly diagnosed people in South Korea also increased rapidly. The most newly confirmed patients are in the age group 20s and 50s. There are more newly diagnosed female patients than male patients.A large number of newly confirmed cases came from the province of Gyeongsangbuk-do, and the number of newly confirmed diagnoses in other provinces gradually increased a few days later.
2.2 Prediction of the Recovery Time of the Patients
2.2.1 Data Preprocessing
#count miss value of each columnindex_array=np.sum(PatientInfo.isnull()==True,axis=0)print(index_array)print(PatientInfo.shape[0])
I want to build a regression model to predict of the recovery time of the patients.
Firstly I do data preprocessing to the data.It is easy to define the ‘recovery time’ as the difference between ‘released_date’ and ‘confirmed_date’.For the ‘deceased’patients,I choose to simply delete it in this task and I will find the rows that released_date is not nulland calculte the recovery time and join the table with the table Region to get more information.
Finally I will delete some useless column and do some data coding and discretization and normalization.
data_released=PatientInfo[PatientInfo['released_date'].notnull()]data_released=data_released[data_released['state']=='released']data_released['recover_days']=(pd.to_datetime(data_released['released_date'])-pd.to_datetime(data_released['confirmed_date'])).dt.daysRegion=pd.read_csv('Region.csv')result=pd.merge(data_released,Region,on=['city','province'])#count miss value of each column of the new tableindex_array=np.sum(result.isnull()==True,axis=0)print(index_array)
We notice that some columns with too much miss value should be ignored. Since birth_year is more accurate than age, we choose to fill the miss value in age and then use age to fill birth_year with media value i.e. 20s will be considered as 25 years old and thenwe use 2020-25 to calculate his birth_year.
#fill the miss value of the column age with mode '20s'result['age']=result['age'].fillna('20s')result['age_int']=result['age'].apply(lambdax: (int)(x[0:1]))result['birth_year']=result['birth_year'].fillna(2020-5-result['age_int']*10)index_array=np.sum(result.isnull()==True,axis=0)print(index_array)
It is well known that how long a confirmed patient will recover mainly depends on his physical health condition and whether he has other underlying diseases. We do not have any data on the health indicators of patients, so it is very hard to build an accurate model.
I will compare some models to see how they work on this dataset.
Firstly I will divide the samples into training set and test set and the test set accounted for 30%.
I find that no model is accurate because the most important information i.e. physical health condition and whether he has other underlying diseases is unknown.
Comparing the four models, I find that the Assembled model did worst because it may be overfitting in this dataset.
In the other three models, LassoRegression did best because it uses LASSO Regression with L1 form which has a very good variable selection effect.
RidgeRegression uses Ridge Regression with L2 form which has a good variable selection effect so it did better than LinearRegressionbut it did worse than LassoRegression.
The LinearRegression did worst in the three models without assembling because it did not use Regression to solve multicollinearity.
2.2.3 Feature Selection and Model Selection
We know that not all features in the dataset will affect the recover time of a patient and some columns may have multicollinearityand some columns maybe unrelated to the recover time.
A method to choose feature is using iteration and the greedy algorithm. In other words, we choose subset of the features and builda model on the training set and then test the model on the test set. If the model works better than the prvious one, we keep theselection, otherwise we reselect features and build models aga until the model will not be impoved by selecting features.
However, greedy algorithm can find a better model but the result may be not a global optimum solution because it may bestuck in a local optimum solution.
The followings are the new model I built after variable selection.
# Divide the samples into training set and test set.col1=['elementary_school_count','elderly_population_ratio','latitude','elderly_alone_ratio','nursing_home_count','age_int','recover_days']temp1=result[col1]y=temp1['recover_days']X=temp1.drop(['recover_days'],axis=1)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=33)
By using feature selection, It is obvious that the R_square_value of the four models all increase and the MSE_value of the four modelsall decrease which means the four models are all improved.
The best model is the Assembled model: GradientBoostingRegressor because its R_square_value is maximal and its MSE_value is minimal.However, the assembled model has poor interpretability.
#The dimension of the final data is 3pca=PCA(n_components=3)X_pca=pca.fit_transform(data_pca)df=pd.DataFrame(X_pca)df= (df-df.min())/(df.max()-df.min())df
After using PCA to Do Do Data Preprocessing, we did not get a better model as we have expected. The result is much worse than the result of using the Greedy Algorithm to do the Feature Selection.
I guess the reason is that maybe the dataset is not suitable for the method PCA to do data preprocessing.
# generate the linkage matrixZ=linkage(a,'complete','minkowski')
# calculate full dendrogramplt.figure(figsize=(25,10))plt.title('Hierarchical Clustering Dendrogram')plt.xlabel('sample index')plt.ylabel('distance')dendrogram(Z,no_labels=True)plt.show()
plt.title('Hierarchical Clustering Dendrogram (truncated)')plt.xlabel('sample index')plt.ylabel('distance')dendrogram(Z,truncate_mode='lastp',# show only the last p merged clustersp=3,# show only the last p merged clustersshow_leaf_counts=False,# otherwise numbers in brackets are countsshow_contracted=True,# to get a distribution impression in truncated branches)plt.show()
#We use the MDS method to reduce the dimension of the distance matrix, and use different colors to represent the clustering results.seed=np.random.RandomState(seed=3)mds=manifold.MDS(n_components=2,max_iter=3000,eps=1e-9,random_state=seed,n_jobs=1)pos=mds.fit(a).embedding_
k=3clusters=fcluster(Z,k,criterion='maxclust')plt.figure(figsize=(10,8))plt.scatter(pos[:,0],pos[:,1],c=clusters,cmap='prism')# plot points with cluster dependent colorsplt.show()
We use the MDS method to reduce the dimension of the distance matrix, and use different colors to represent the clustering results.The figure above shows that the results is good.
The cities are divided into three different risk levels: high, middle, or low.
In the high risk level cities(cluster 3), the number of confirmed patients, the elderly_alone_ratio and the elderly_population_ratiois the largest but the nursing_home_count and the academy_ratio is the smallest. It is very reasonable to classify the places with high number of diagnosed patients, serious aging, small number of nursing homes and people with low education level as high-risk areas.
In the mid risk level cities(cluster 1), although the number of confirmed patients is large, the nursing_home_count andthe academy_ratio are maximal and the elderly_population_ratio and the elderly_alone_ratio is minimal which means there aremore medical resources for every patient.
In the low risk level cities(cluster 2), the number of confirmed patients is fewest and the academy_ratio, elderly_population_ratio,elderly_alone_ratio and nursing_home_count are all middle valuse.
If I were the policy maker, I would distribute more medical resources to cities with high level of risk, and send medical teams to assist high-risk cities.
3. Conclusion:
By analyzing these data of South Korea, I found COVID-19 is highly transmitted and people of all ages are easily infected. The main transmission route is contact transmission, so reducing aggregation and paying attention to wearing masks are good protective measures. At the same time, we should assist medical resources and send medical teams to high-risk areas to help them fight the epidemic together and overcome difficulties.
About
Data Analyze and Visualization of Coronavirus Data of Korea