Sepsis early warning method based on deep learning model GPT-2Technical Field
The invention belongs to the field of medical data mining, and relates to an early warning method for sepsis based on a GPT-2 model.
Background
Sepsis is a disease that poses a serious threat to life safety, is a systemic inflammatory response syndrome caused by infection, and is one of the main causes of common high-risk complications and death of ICU patients. An estimated 3000 million people worldwide each year suffer from sepsis, and the sepsis treatment cost is very high and the risk is very high due to the number of sepsis fatalities exceeding 600 million people. Sepsis has become a public medical problem of high global concern due to morbidity, mortality, and expensive treatment costs. The clinical diagnostic definition of sepsis has progressed from 1.0 to 3.0 and is also constantly changing. Clinical research on the pathogenesis of sepsis has been advanced to a certain extent, but the pathogenesis of sepsis is complex, more variable factors are involved, and the diagnosis accuracy rate is still to be improved. Studies have shown that early detection of sepsis and timely antibiotic treatment are critical to improving the risk of death in septic patients, with mortality increasing by 4% -8% every hour of treatment delay. The early discovery and timely treatment of patients who are likely to develop sepsis have important research value and significance for improving the survival rate of patients in ICU. Most of the current researches are from the medical point of view, most of the researches use models based on statistical analysis and simple logistic regression, decision trees and the like, and few people apply deep learning models to the medical field. The existing learning methods need manual feature selection in advance, and do not fully utilize all features conventionally collected in an ICU environment, so that potential features of complex nonlinear structures can be ignored by the learned model, and the potential features can have inseparable close relation with the development and prediction of sepsis. Furthermore, the prediction of sepsis should be continuous with the patient's clinical data in the ICU being continuously updated, and current methods cannot handle time-varying inputs. The model based on deep learning is better solving the problem of processing a large amount of rapidly changing high-dimensional complex data in most fields, and more medical staff expect to develop the mining of medical data by using a method based on deep learning so as to help the deep learning and diagnosis efficiency of diseases.
Disclosure of Invention
In order to solve the problems of difficult clinical diagnosis and low accuracy of sepsis of ICU patients in the prior art, the invention provides a sepsis early warning method based on a deep learning model GPT-2. According to the method, multiple physiological index values of a plurality of days of an ICU patient are extracted from an electronic case according to a data mining process, preprocessing operations such as cleaning and data sample unbalance are carried out on the physiological index values, an improved GPT2 model is used for constructing a prediction model, and the probability of sepsis of the patient in the ICU in different days is output, so that the purposes of early warning and reducing the death risk of the patient due to sepsis are achieved. The method can utilize the conventional indexes of the patient in the ICU to the maximum extent, can predict the risk of the patient according to the change of time, is similar to the repeated analysis and updating of the patient management in an intensive care unit during the daily ward visit of an intensive care doctor, combines the clinical performances of the patient in the previous days, makes a prediction result according to a large amount of available data, and has timeliness and higher accuracy.
The purpose of the invention is realized by the following technical scheme:
a sepsis early warning method based on a deep learning model GPT-2 extracts characteristic variables of a patient in an ICU within a selected time span to form a high-dimensional time-varying sequence, inputs the sequence into an improved GPT-2 model after data preprocessing so as to extract an effective representation which is closest to the current disease condition of the ICU patient by combining the recent clinical representation of the patient, and inputs the obtained representation into a fully-connected feedforward network layer to predict the probability that the patient suffers from sepsis within the next time, and specifically comprises the following steps:
step 1: extracting a plurality of predicted characteristic variable sequences of a patient which enters an ICU for a plurality of days from an electronic medical record or medical data set, and distinguishing the characteristic variable sequences by taking time as a sequence, wherein: the prediction characteristic variable sequence is a high-latitude time-varying sequence and mainly shows that: selecting a required time span, and extracting characteristic variable values in the time span of patients entering an ICU ward by taking days as units to form a plurality of characteristic variable sequences changing along with time, wherein the characteristic variables mainly comprise vital sign variables, laboratory measurement indexes, medicine records, demographic information and the like;
step 2: preprocessing the extracted patient data, wherein the preprocessing comprises variable screening, missing value filling, abnormal value processing, feature extraction, sample normalization processing and unbalanced sample processing;
and step 3: after data preprocessing, inputting the data into a GPT-2 model, wherein the GPT-2 model comprises an input module, a processing module and an output module, and the input module comprises:
the input module mainly comprises an embedded layer and is used for converting clinical medical data subjected to data preprocessing into a time sequence which can be processed by a deep learning model;
the processing module mainly comprises an attention mechanism layer and a fully-connected feedforward neural network layer, and has the core effects of carrying out nonlinear complex transformation on the time sequence obtained by the input module, mining potential characteristics related to sepsis illness, and combining all the obtained characteristics to represent the current illness state of the patient;
the prediction module mainly consists of a fully connected feedforward neural network layer, and maps the output of the processing module into a probability value which represents the probability that the patient predicted by the model according to the clinical data of the patient to date is sepsis in the next time;
the method comprises the following specific steps:
(1) given input X ═ X1,x2,...,xt) And label Y ═ Y (Y)1,y2,...,yt) Wherein: t is the maximum time span, x, over which ICU patient data is extractediA set of values of a characteristic vector sequence representing the ith day of an ICU patient, yiRepresents whether the infused ICU patient is septic or not on day i;
(2) inputting X and Y into an input module of a GPT-2 model, regarding the input feature vectors as different word vectors, and inputting the word vectors into an Embedding layer (Embedding layer) to obtain an embedded representation h of the feature vectors0=XWeWherein: weIs an embedded vector representation of each feature obtained through training;
(3) h is to be0And transmitting the GPT-2 model into a processing module of the GPT-2 model to obtain:
hm=gpt_layer(hm-1),m∈[1,t],
wherein: h ismRepresenting a representation of the patient's eigenvectors over m days of the ICU, hm-1Is a representation of the patient's feature vector over m-1 days of the ICU;
(4) h to be obtained from the processing modulemInput to GPT-2 prediction module of the model, prediction tag ym:
P(ym|x1,x2,...,xm)=sigmoid(hmWy),
Wherein: y ismDenotes the predicted result on day m, WyA parameter matrix representing the prediction output;
and 4, step 4: training a GPT-2 model, finding the optimal parameters through training, and continuously adjusting the optimal parameters to ensure that the GPT-2 model has stable and optimal effect, and the specific steps are as follows:
(1) dividing the data set into a training set, a validation set and a test set, wherein: the GPT-2 model is trained on a training set only, a verification set is only used for super-parameter adjustment, and a test set is only used for evaluating the effect of the GPT-2 model;
(2) a binary cross entropy loss function is adopted to train the GPT-2 model, and the formula of the binary cross entropy loss function is as follows:
wherein, p (y)i| x) is the probability of the patient suffering from sepsis at the current input;
(3) the GPT-2 model was evaluated using accuracy P, recall R, F1-score value F1, and ROC _ AUC score:
wherein: t is
pIs to accurately predict the number of diseased samples, F
pIs erroneously predicted to be diseasedNumber of samples, F
nThe number of samples that are not diseased are predicted incorrectly; the ROC curve is a curve drawn by using FPR (R is an abscissa and TPR is an ordinate), AUC is the area enclosed by the ROC curve and the abscissa,
(4) modifying the value of the epoch of the round of training on the sample, repeatedly training and continuously adjusting and optimizing the GPT-2 model, and terminating the training of the GPT-2 model if the loss value of the GPT-2 model is basically stable or the AUC value does not rise any more and no overfitting condition occurs;
and 5: after the GPT-2 model is trained, the latest clinical data of the patient are combined with the past clinical performance to predict the sepsis.
Compared with the prior art, the invention has the following advantages:
1. information in the ICU environment is in an overloaded mode, and existing techniques generally adopt features processed manually, and do not fully utilize all features conventionally collected in the ICU environment, thereby possibly leading a learned model to ignore complex nonlinear potential features which may have inseparable affinity for the prediction of sepsis. The invention adopts clinical data of a patient entering an ICU for a plurality of days, the data comprises basic characteristics collected in an ICU environment, the high-dimensional clinical data is input into a GPT-2 model to learn potential characteristics constructed by a complex nonlinear method after being preprocessed by variable screening, missing value filling, abnormal value processing, characteristic extraction and the like, and the information of overload in the ICU environment is refined into factors which are most relevant to the patient at any given moment as much as possible, so that whether the patient has the risk of sepsis is predicted.
2. The existing technology generally adopts a model based on statistical analysis and simple logistic regression, decision tree and the like, and the GPT-2 model is adopted in the invention, so that the GPT-2 model has the advantages that the GPT-2 model can more excellently process high-dimensional complex data, and potential characteristics of nonlinear change associated with sepsis generation are extracted. Because the existing GPT-2 pre-training model is obtained by training on a corpus related to non-medical neighborhood (mainly aiming at natural language processing), the existing GPT-2 pre-training model cannot be directly used for processing clinical data, so that the GPT-2 pre-training model is correspondingly improved and is divided into an input module, a processing module and a prediction module. The embedded representation of model input is trained by using large-scale medical data on an input module, the sequence of elements in a characteristic vector sequence is allowed to be different, so that the embedded representation of the model input can be processed by the embedded representation of the characteristic vector sequence of clinical data, the processing module obtains a simplified GPT-2 internal architecture through continuous tuning and optimization to avoid overfitting of the model, and the prediction module is modified into a module capable of performing classification task prediction. The improved GPT-2 model is retrained by using a medical related corpus, potential features in clinical data are extracted, a sepsis prediction task is completed in a mode of higher prediction classification precision, and a patient is helped to be rescued in time.
3. The invention can be used for predicting the current risk of the patient from a large amount of available data by combining the clinical performance of the patient for a plurality of days like repeatedly analyzing and updating the patient management of the intensive care unit by the intensive care doctor during the ward round every day, and has timeliness and higher accuracy.
Drawings
FIG. 1 is a flow chart of the sepsis early warning method based on the deep learning model GPT-2 of the present invention;
FIG. 2 is a GPT-2 overall structural framework improved by the present invention;
FIG. 3 is the details of the improved GPT-2 processing input and output;
FIG. 4 is a graph comparing accuracy of a model tuning process;
fig. 5 is an example of sepsis prediction.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a sepsis early warning method based on a deep learning model GPT-2, which divides the sepsis early warning treatment into five stages of variable extraction, data preprocessing, model building, model training and prediction result output, and specifically comprises the following steps as shown in figure 1:
step 1, extracting 14-day clinical records from MIMIC III of patients who entered the ICU ward for 2 days or more to constitute an initial data set, comprising a total of 56841 samples (considering all records of each patient that entered the ICU multiple times); 119 detection indexes of each patient are extracted, and the detection indexes mainly comprise vital sign variables (such as heart rate, diastolic pressure and the like); laboratory measurement indices (creatinine, blood urea nitrogen, etc.); drug records (cefepime, aspirin, etc.); demographic information (weight, age, etc.), see table 1 in particular:
TABLE 1 sepsis monitoring index table
And 2, after the data are extracted, preprocessing the data according to the conditions of different degrees of deletion and abnormality of the data. Generally, the original data set has the problems of human misoperation, equipment error, inconsistent time step of variable measurement and the like, so that partial data of the data set are distorted, the quality is not high, the data set cannot be directly used, and certain preprocessing operation is required. The data preprocessing mainly comprises the following steps:
(1) variable screening: setting a deletion rate threshold value, and screening out variables with deletion rates larger than the threshold value;
(2) missing value filling: filling missing values into two parts, wherein the first part is the missing of partial measured values, and the average value of the characteristic variable is used for replacing the missing values; the second part is that the number of days of the patient in the ICU ward is less than the time span of data extraction, and at the moment, the characteristic vector sequence of the patient is filled with 0 in the period from the time that the patient leaves the ICU ward to the time that the data extraction is finished;
(3) abnormal value processing: the processing method uses a percentile principle to detect a feature vector with a value of more than 95% of the feature in a sample, and uses a median of the feature to replace the feature value;
(4) feature extraction: expanding the characteristic variables, and performing characteristic expansion from the aspects of maximum value, minimum value, average value and standard deviation according to the characteristic that the medical scoring system uses the sample characteristic value;
(5) and (3) sample standardization treatment: subtracting the respective average value from each characteristic variable, and dividing the average value by the standard deviation to standardize the characteristic variables, so that the value ranges of different characteristic variables are all in [0,1 ];
(6) sample imbalance treatment: because the proportion of patients with diseases is small, the sample used for model learning has a serious imbalance problem, and the learned model may have serious bias, so a SMOTE algorithm is adopted for sample balance processing, and SMOTE (synthetic Minority Oversampling technology) is synthesized to synthesize a few classes of Oversampling technologies, which is an improved scheme based on a random Oversampling algorithm.
And 3, inputting the data into the constructed GPT-2 model after data preprocessing. The invention uses a modified GPT-2 model to construct a supervised prediction model, which mainly comprises an input module, a processing module and an output module, and the method comprises the following steps:
(1) input module
Given input X ═ X1,x2,...,xt) And label Y ═ Y (Y)1,y2,...,yt) Wherein: t is the maximum time span, x, over which ICU patient data is extractediA set of values of a characteristic vector sequence representing the ith day of an ICU patient, yiRepresenting whether the infused ICU patient was septic or not on day i. Inputting X and Y into an input module of a GPT-2 model, regarding the input feature vectors as different word vectors, and inputting the word vectors into an Embedding layer (Embedding layer) to obtain an embedded representation h of the feature vectors0=XWeWherein: weIs an embedded vector representation of each feature obtained through training. Because the medically characteristic variables are different from the word characteristics of natural language: even if the positions are different, the meaning expressed by the composed medical feature vector sequences is the same, so in the input module part, the invention abandons the position coding processing of GPT2 and only carries out embedding processing operation on the input feature vectors.
(2) Processing module
The processing module consists of one GPT _ layer which consists of two sub-layers, namely a self-attention mechanism layer with a mask and a fully-connected feedforward network layer, and the GPT-2 model overall framework is shown in FIG. 2. Each sub-layer is added with residual connection and regularization processing, so that the output of the sub-layer can be expressed as:
sub_layerioutput=LayerNorm(sub_layeriinput+(sub_layeri(sub_layeriinput)),
wherein: sub _ layeriinput comes from the sub _ layer of the previous layeri-1Output of (1), i.e. sub _ layeri-1output, so the output of gpt _ layer with 2 layers of sub _ layer can be calculated by:
gpt_layer=LayerNorm(sub_layer1output+sub_layer2(sub_layer1output)),
where LayerNorm is the normalization of the layer. To obtain h0Then h is0Is transmitted into GPT-2After module management, the following results can be obtained:
hm=gpt_layer(hm-1),m∈[1,t],
wherein: h ismRepresenting the representation of the patient's feature vector over m days of the ICU, the specific flow is shown in fig. 3.
(3) Prediction module
The prediction module consists of a fully-connected feedforward neural network layer and inputs hmAnd use of hmRepresentation to predict label ym:P(ym|x1,x2,...,xm)=sigmoid(hmWy) Wherein y ismDenotes the predicted result on day m, WyRepresenting a matrix of parameters at the time of prediction output.
And 4, training the prediction model, finding the optimal parameters through training, and continuously adjusting the optimal parameters to ensure that the model effect is stable and optimal. The data set is divided into a training set, a verification set and a test set according to a 7:1:2 mode, a model is trained on the training set only, the verification set is used for hyper-parameter adjustment only, and the test set is used for evaluating the model only.
The GPT-2 model training adopts a binary cross entropy loss function, and the formula is as follows:
wherein, p (y)i| x) is the probability of the patient developing sepsis at the current input.
The evaluation index used was the precision P, recall R, F1-score value F1 and ROC _ AUC score:
wherein: the ROC curve is a curve drawn by using FPR (R is an abscissa and TPR is an ordinate), AUC is the area enclosed by the ROC curve and the abscissa,
and modifying the round epoch trained on the sample, repeatedly training and continuously optimizing the model, and terminating the training of the model if the loss value of the model is basically stable or the AUC value does not rise any more and no overfitting condition occurs, wherein the optimization process is shown in figure 4.
The sepsis mortality prediction system based on the integrated model has the advantages that the accuracy, the ROC and the F1 values of the sepsis mortality prediction system are superior to those of the current model, and the comparison results of the accuracy and the F1 score are shown in the table 2:
TABLE 2 GPT-2 compares the best performing model at hand
Andstep 5, after the model is trained, sepsis prediction is carried out on a new patient, the probability that the patient suffers from sepsis every day and the importance degree of each feature are sequentially output according to the change of days, and a doctor is helped to make clinical decision, wherein a specific example is shown in fig. 5.