Blood donator identification and recruitment method based on machine learningTechnical Field
The invention relates to a method for identifying and recruiting blood donators, in particular to a machine learning-based method for identifying and recruiting blood donators, and belongs to the technical field of intelligent blood donation management methods.
Background
Blood is a source of life and a very important and scarce medical resource. Modern medical development to date, blood transfusion as an irreplaceable medical means for saving death and supporting injuries, whether planned treatment or emergency intervention, is free from the support of blood which cannot be synthesized and stored for a long time and cannot be replaced by other medicines. Blood transfusions, widely used in surgery, trauma, malignancy, pregnancy, hemophilia, etc., are vital components in saving life and improving health, and can help millions of patients suffering from life-threatening conditions extend life and improve their quality of life each year. Since the development of the uncompensated blood donation work in 1998, the number of the national uncompensated blood donors keeps continuously increasing for 20 years through years of effort, and the number of the national uncompensated blood donors is increased from about 30 ten thousand in 1998 to nearly 1500 ten thousand in 2018. Although the total amount of blood collected nationally is increased to over 2500 million units in 2018 from less than 500 million units in 1998 by more than 4 times, the increase speed of clinical blood demand cannot be adapted, the contradiction between the rigidity demand of clinical blood and insufficient blood supply still stands out, particularly, the contradiction becomes normal in large cities with developed economy and rich medical resources, the wide and continuous spread of the large cities is caused by local, seasonal and structural properties, the shortage of blood supply causes greater and greater contradiction between doctors and patients, and the blood safety risk still exists.
At present, the collection mode of the domestic uncompensated blood donation is mainly realized in a passive mode that a blood supply mechanism waits for blood donation of a blood donor in a street, even if a part of uncompensated blood donations adopt active recruitment modes such as short messages, telephones, mails and the like, scientific analysis on the health dynamics and the blood donation willingness of historical blood donators is lacked, and the efficiency of the blood donation recruitment is generally low. Therefore, on the basis of establishing and perfecting government-dominated and multi-department coordinated uncompensated blood donation guarantee and incentive mechanisms, technical attack based on big blood donation data is developed, a high-efficiency donor recruitment technology adaptive to uncompensated blood donation work is established and optimized, the efficiency and quality of blood donation recruitment are improved, and more donors are recruited, so that the method is an important path for meeting the current 'tight balance' current situation of blood supply and ensuring the requirement and safety of clinical blood. The present invention aims to solve this technical problem.
Disclosure of Invention
The invention aims to solve the technical problem of the prior art and provides a blood donator identification and recruitment method based on machine learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
a blood donator identification and recruitment method based on machine learning comprises the following steps:
(1) establishing a blood donator feature library, collecting the blood donation condition of each blood donator over the years, and designing and calculating the feature of each blood donator according to the blood donation condition;
(2) constructing an xgboost model, wherein the model input is the characteristics of the blood donator and the model output is the expected output of whether the blood donator is an effective blood donator;
(3) acquiring data of available/invalid blood donators by using past short messages, and training the xgboost model;
(4) and for qualified blood donators, making effective blood donator ranking by using the xgboost model according to the characteristic data of the qualified blood donators, and recommending the blood donators.
Preferably, in the step (1), the designed and calculated characteristics include age, sex, blood type, recent donation amount, total donation amount, number of donations, interval of donations, frequency of donations, qualification of blood test, education level, living condition, occupation and reaction of donation, wherein the frequency of donation is calculated by formula (1).
Preferably, in the step (1), when some attribute values are missing, default values are adopted for complementing.
Preferably, in the step (2), the basic model parameters of the xgboost model use the tree model as a base classifier, the iteration number, that is, the number of used trees, is set to be 500, and the maximum depth of the tree is set to be 12.
Preferably, in the step (2), the training parameters of the xgboost model are L objective functions for classifying blood donors, the model evaluation function is set to be a two-classification error rate, the learning rate is 0.3, and in order to avoid the influence caused by overfitting during training of each piece, the sampling rate of the data set is set to be 90%, and the selected feature proportion is also 90%.
Preferably, in the step (3), the training step is as follows:
(3.1) tracking the personnel receiving the short messages by means of the previous short message recruitment information, and if blood donation records exist within seven days, determining the personnel as effective blood donation personnel, otherwise, determining the personnel as ineffective blood donation personnel;
(3.2) constructing the collected characteristics of the effective blood donators and the ineffective blood donators when sending short messages to obtain training data of the model;
(3.3) training the xgboost model with the training data as model input and whether the person is a valid donor as expected output.
Preferably, the step (4) comprises the following sub-steps:
(4.1) the blood collection and donation interval period is in accordance with the regulation (the interval period between two times of whole blood/single erythrocyte and whole blood donation is not less than 6 months, and the interval period between two times of single blood platelet/plasma/single granulocyte and whole blood donation is not less than 4 weeks), and the blood donor characteristics of healthy citizens aged from eighteen years to fifty-five years or multiple blood donors who have no blood donation reaction and are in accordance with the health examination requirements and are not more than sixty years old;
(4.2) obtaining a feature set of the subject from a blood donator feature library;
and (4.3) inputting the basic attribute of each blood donation person into the trained xgboost model to obtain the expected output of the effective blood donation persons, sequencing according to the output result, and recommending the persons with higher rank.
Compared with the prior art, the invention utilizes the annual blood donation data of the blood center to construct a basic attribute library of blood donators, obtains training data through the past short message recruitment data, and obtains an xgboost model through training. After the model training is completed, when new short messages need to be issued for recruitment, the blood donator data meeting the conditions can be correspondingly screened, then the xgboost model can be used for giving reference recognition results, the reference recognition results are ranked, and corresponding recruitment objects are selected according to the required number of people. The method is realized by adopting an xgboost model, and effectively excavates the importance degree of the attribute importance, and effectively learns the internal relation of the attribute of the blood donation object. Experimental results show that the classification precision achieved by the method can reach more than 80%, and the method is obviously superior to the prior art.
Drawings
Fig. 1 is a flow chart of a conventional blood center for selecting suitable recruiting subjects;
note: 1. age regulation: healthy citizens aged eighteen to fifty-five or multiple blood donors who have no blood donation reaction and meet the requirement of health examination and do not exceed sixty years old;
2. blood donation interval regulation: the interval between two times of blood collection/single erythrocyte collection and whole blood donation is not less than 6 months, and the interval between two times of single blood collection platelet/plasma/single granulocyte collection and whole blood donation is not less than 4 weeks;
FIG. 2 is a schematic diagram of the xgboost model used in the present invention.
Detailed Description
The technical contents of the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
The invention provides a method for improving short message recruitment precision of blood donators by means of a machine learning method. The following are specifically described below:
a method for recruiting a blood donor by a short message commonly used in a blood center is shown in fig. 1, and blood donors with a specified blood donation interval and age are mainly selected from blood donors in the last year. The part of subjects who have participated in blood donation activities recently shows that the subjects have willingness to donate blood recently, and is a reasonable recruitment target, so that the blood donation time is an effective characteristic for whether to participate in blood donation. Meanwhile, according to the past experience of recruiting blood donations in the blood center, it is important to find out whether the occupation and education degree is important for a person to donate blood. Table 1 shows a blood donation record of a blood donor, which contains basic characteristics of the individual, such as age, sex, residence condition, blood donation response, and the like, and the basic information also has influence on the individual's blood donation. After research, the inventor finds that the willingness of a blood donor is not only dependent on one blood donation, but is often related to the record of multiple blood donations, such as the number of blood donations, the total amount of blood donations, the number of blood donations, the frequency of blood donations and the like. According to the combination of experience and experimental study, a characteristic library containing 13 items of age, sex, blood type, recent blood donation amount, total blood donation amount, blood donation times, blood donation interval, blood donation frequency, whether blood detection is qualified or not, education degree, living condition, occupation and blood donation reaction is designed, wherein the blood donation frequency is calculated by adopting a formula (1).
TABLE 1 recording of blood donation by a blood donor
The blood donation records often have certain missing values, such as education degree missing, occupation missing and the like, in order to deal with the situation, default values are adopted to replace the missing values, the missing education degree is replaced by education degrees of junior high and below, the missing occupation is mainly replaced by other occupation due to the fact that the occupation is not classified in a blood donation system, and the missing living state is replaced by temporary living.
In the embodiment of the present invention, the xgboost model shown in fig. 2 is used, which uses a decision tree model as a base classifier and uses 500 trees, wherein the model input is the characteristic of the blood donator, each characteristic data is firstly classified in a first tree and finally falls into one leaf node, the value of the obtained leaf node is the output value of the first tree, then the same operation is performed by using a second tree, and the output values of all the trees are added, and the expected output of the effective blood donator is output after calculation by using L g donations function, so that the sampling rate of the fitting data set is 90% in order to avoid the situation that the fitting data set occupies 90% in each tree.
In one embodiment of the present invention, the input to the xgboost model is assumed to be
According to the xgboost model, using the first tree we obtain:
wherein,
the weight of the leaf node of the model input x falling into the tree is obtained, the second tree also adopts the same operation as the first tree, and the following analogy is carried out to obtain:
…
produced for each treeWeight set
Add them to obtain the final predicted target value
The goal of the final enrollment is to derive a expectation of whether the donor is an effective blood donation goal
Therefore, when outputting, it needs to be adjusted by using L g-logic function, which is expressed as:
before recruiting using the xgboost model, the model must be iteratively learned, i.e. each tree is generated, and the generation of each tree mainly selects the best splitting characteristics and leaf weights. The model determines the splitting characteristics and the leaf node weight of the tree model according to the loss function, and the calculation only depends on the first derivative and the second derivative of the loss function, and the current optimal splitting characteristics and weight can be directly obtained, so the calculation speed is high. The training process can be realized by using a Python third-party library xgboost, the loss function is defaulted in the third-party library, and the loss function can be designed by self. In this case, the default loss function is better. It should be noted that the hyper-parameters such as the number of decision trees and the learning rate used in the experiment are obtained when the optimization is obtained from the verification set.
In the invention, the model training process is a process for determining how the tree model in the formula is divided and the leaf node weight, and the specific training steps are as follows:
(1) and tracking the personnel receiving the short messages by means of the previous short message recruitment information according to the date of sending the short messages, and if the personnel have blood donation records within seven days, namely participate in blood donation, the personnel are considered as effective blood donation personnel, otherwise, the personnel are ineffective blood donation personnel.
(2) And establishing a blood donator feature library when the collected effective blood donators and the collected ineffective blood donators send short messages, and selecting the features of the corresponding effective blood donators and the ineffective blood donators to obtain a data set of the model.
(3) And (3) taking the training data in the step (2) as model input, using a predefined loss function, and taking whether the person is an effective blood donor as expected output to train the xgboost model and determine the optimal segmentation characteristics and the leaf node weight of the tree model.
In the embodiment of the invention, 95476 short message recruitings are implemented from 2016 to 2019 in total according to the long-term short message recruiting records, wherein 56026 effective blood donator data and 39450 ineffective blood donator data are provided. For these data, to keep the ratio of valid blood donations to invalid blood donations the same, the same amount of data as the number of invalid blood donations is selected from the valid blood donations to construct a data set. And (3) dividing the data set into a training set, a verification set and a test set, training and learning the hyper-parameters and the parameters of the model, and determining the optimal tree model parameters and splitting characteristics.
After the model training is completed, the model can give the importance degree of different features to the classification, and table 2 gives the four most important features and the importance values thereof. Once model training is complete, it can be used in a process for efficient donor identification.
TABLE 2 top four tables of importance of model features
Next, the specific steps of using the trained xgboost model to recruit and recommend blood donation subjects are introduced:
(1) collecting blood donation personnel whose blood donation age and blood donation interval are in accordance with blood donation related regulations.
(2) The characteristics of the subject are obtained from a blood donator basic characteristic library.
(3) And inputting the basic attribute of each blood donator into the trained xgboost model to obtain the expected output of the effective blood donators, sequencing according to the output result, and recommending the personnel with higher rank.
Compared with the prior art, the invention utilizes the annual blood donation data of the blood center to construct the blood donator feature library, obtains training data through the past short message recruitment data, and obtains the xgboost model through training. After the model training is completed, when new short messages need to be issued for recruitment, the blood donator data meeting the conditions can be correspondingly screened, then the xgboost model can be used for giving reference recognition results, the reference recognition results are ranked, and corresponding recruitment objects are selected according to the required number of people. The method is realized by adopting an xgboost model, and effectively excavates the importance degree of the attribute importance, and effectively learns the internal relation of the attribute of the blood donation object. Experimental results show that the classification precision achieved by the method can reach more than 80%, the method is obviously superior to the prior art, and the accuracy of donation recruitment is improved, so that more manpower and material resources are saved, and the efficiency and quality of recruitment are improved.
The blood donator identification and recruitment method based on machine learning provided by the invention is explained in detail above. It will be apparent to those skilled in the art that any obvious modifications thereof can be made without departing from the spirit of the invention, which infringes the patent right of the invention and bears the corresponding legal responsibility.