Movatterモバイル変換


[0]ホーム

URL:


CN111444686A - Medical data labeling method, device, storage medium and computer equipment - Google Patents

Medical data labeling method, device, storage medium and computer equipment
Download PDF

Info

Publication number
CN111444686A
CN111444686ACN202010181144.7ACN202010181144ACN111444686ACN 111444686 ACN111444686 ACN 111444686ACN 202010181144 ACN202010181144 ACN 202010181144ACN 111444686 ACN111444686 ACN 111444686A
Authority
CN
China
Prior art keywords
data
medical
labeling
model
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010181144.7A
Other languages
Chinese (zh)
Other versions
CN111444686B (en
Inventor
李然
沈宏
李蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhongke Medical Technology Industrial Technology Research Institute Co Ltd
Original Assignee
Shanghai United Imaging Intelligent Healthcare Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai United Imaging Intelligent Healthcare Co LtdfiledCriticalShanghai United Imaging Intelligent Healthcare Co Ltd
Priority to CN202010181144.7ApriorityCriticalpatent/CN111444686B/en
Publication of CN111444686ApublicationCriticalpatent/CN111444686A/en
Application grantedgrantedCritical
Publication of CN111444686BpublicationCriticalpatent/CN111444686B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The application relates to a medical data labeling method, a device, a storage medium and computer equipment, when medical data labeling is carried out, after a word vector corresponding to data with the lowest granularity is obtained by encoding medical data, the method further comprises the step of carrying out combined processing on adjacent word vectors corresponding to adjacent data with medical relevance, so that word vector combinations with different granularity task levels can be obtained, then the word vectors and the word vector combinations are labeled through a labeling model, so that the obtained medical attribute class labeling result contains labeling results of the data with different granularities, the labeling result is more comprehensive, and the method is favorable for data mining analysis of electronic medical records.

Description

Medical data labeling method, device, storage medium and computer equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a medical data labeling method, device, storage medium, and computer device.
Background
Along with the development of electronic technology, the application of electronic medical records in various hospitals is more and more popular, and the electronic medical records are different from the traditional paper medical records and are convenient to store and manage. Electronic medical records generally store important information related to clinical treatment such as disease symptoms and diagnosis processes of patients, and are closely related to the health of patients, so that data mining and analysis of the electronic medical records are widely concerned in recent years, and research of named entity identification (namely, identification of medical attribute types of data in the electronic medical records, such as disease parts, disease symptoms, treatment means and the like) is of great significance as an important basic task of natural language processing.
In the prior art, a network model is generally used for named entity recognition. And training the initial model through sample data by a user to obtain a network model for named entity recognition, and then recognizing and labeling the new electronic medical record through the network model. However, in the prior art, only the named entity recognition task is trained during model training, and the obtained model can only extract information of the sample entity granularity, but cannot effectively extract information of other sample granularities, such as character granularity, sentence granularity, or text granularity, so that the labeling result is incomplete, and the data mining analysis of the electronic medical record is not facilitated.
Disclosure of Invention
In view of the above, there is a need to provide a medical data annotation method, apparatus, storage medium and computer device that are helpful to improve the comprehensiveness of the annotation result.
A medical data annotation method, comprising:
acquiring medical data to be labeled and a pre-trained labeling model;
coding the medical data to obtain a word vector corresponding to the data with the lowest granularity in the medical data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities;
and performing data attribute category labeling on each word vector and each word vector combination through the labeling model to obtain a medical attribute category labeling result of the medical data.
A medical data annotation apparatus, comprising:
the acquisition module is used for acquiring medical data to be labeled and a pre-trained labeling model;
the encoding module is used for encoding the medical data to obtain a word vector corresponding to the data with the lowest granularity in the medical data, and performing combination processing on adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities;
and the labeling module is used for performing data attribute class labeling on each word vector and each word vector combination through the labeling model to obtain a medical attribute class labeling result of the medical data.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
The medical data labeling method, the medical data labeling device, the storage medium and the computer equipment acquire medical data to be labeled and a labeling model trained in advance; coding medical data to obtain word vectors corresponding to data with the lowest granularity in the medical data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities; and carrying out data attribute category labeling on each word vector and each word vector combination through a labeling model to obtain a medical attribute category labeling result of the medical data. When medical data is labeled, after the word vector corresponding to the data with the lowest granularity is obtained by encoding the medical data, the method further comprises the step of combining and processing the adjacent word vectors corresponding to the adjacent data with medical relevance, so that word vector combinations with different granularity task levels can be obtained, and then labeling is carried out on the word vectors and the word vector combinations through a labeling model, so that the obtained labeling result of the medical attribute category comprises the labeling result of the data with different granularities, and the labeling result is more comprehensive and is favorable for data mining and analysis of the electronic medical record.
Drawings
FIG. 1 is a flow chart illustrating a method for annotating medical data in one embodiment;
FIG. 2 is a schematic flow chart illustrating a training process for a label model in one embodiment;
FIG. 3 is a flowchart illustrating model training performed on second data to obtain a preliminary training model according to an embodiment;
FIG. 4 is a flow chart illustrating the process of creating a tagged thesaurus in one embodiment;
FIG. 5 is a flowchart illustrating a procedure of adding a corresponding medical attribute category label to data matching the key medical data in the second data according to the label lexicon to obtain a medical attribute category labeling result of the second data in one embodiment;
FIG. 6 is a flowchart illustrating model training of an initial model according to second data and medical attribute classification labeling results of the second data to obtain a preliminary training model in one embodiment;
FIG. 7 is a diagram of an example of training an annotation model in one embodiment;
FIG. 8 is a schematic diagram of the structure of a medical data labeling apparatus according to an embodiment;
FIG. 9 is a schematic structural diagram of a medical data labeling apparatus according to another embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a medical data annotation method is provided, which is explained by taking an example that the method is applied to a processor capable of performing medical data annotation, and the method mainly includes the following steps:
and S100, acquiring medical data to be labeled and a pre-trained labeling model.
The medical data to be labeled can be an electronic medical record and the like, the electronic medical record stores information such as disease symptoms and medical diagnosis processes of a medical diagnosis target object, the labeling processing can be labeling processing of medical attribute categories of the data in the electronic medical record, and can be medical named entity identification labeling processing, wherein the medical named entities include body parts, treatment modes, examination means, abnormal symptoms, disease types and the like. The pre-trained labeling model is a model obtained by performing medical attribute class labeling training on the initial model.
Step S200, encoding the medical data to obtain a word vector corresponding to the data with the lowest granularity in the medical data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities.
After obtaining the medical data to be labeled, the processor may perform encoding processing on the medical data, that is, convert the medical data in a text form into a word vector in another encoding form, for example, may convert the medical data into a numerical value, and the like. For adjacent word vectors with correlation, corresponding word vector combinations can be formed, the word vector combinations with different granularities can be specifically different in length, and the word vector combinations with different granularities can be used for representing words, sentences, paragraphs and the like. Different word vectors and different granularities of word vector combinations may be used to represent different classes of data or different objects.
For example, certain medical data specifically includes the following: the abdomen is soft, no pressure pain, no rebound pain, no touch under the right rib of the liver, no touch under the left rib of the spleen, normal bowel sound and no edema of the two lower limbs are caused. In the medical data, "liver" is a word vector, "left rib of spleen", "bowel sound" is a word vector combination, "liver" and "left rib of spleen" represent the same data category, i.e., body part, but the specific representation objects are different; the "left spleen rib" and the "bowel sound" indicate different data types, specifically, the "left spleen rib" indicates a body part, and the "bowel sound" indicates an examination method. In addition, the whole medical data may be regarded as a word vector combination, that is, the word vector combination corresponding to the whole medical data may be used to represent a specific disease symptom.
And step S300, performing data attribute category labeling on each word vector and each word vector combination through a labeling model to obtain a medical attribute category labeling result of the medical data.
After the processor encodes the medical data to obtain the word vectors and the word vector combinations, the processor performs data attribute class labeling on the word vectors and the word vector combinations through a pre-trained labeling model, so as to obtain labeling results corresponding to the word vectors and the word vector combinations, and further obtain medical attribute class labeling results of the medical data. For example, for the data "heart," its corresponding label may be a body part; for data "chemotherapy," the corresponding label may be the treatment modality; for a data "magnetic resonance scan", its corresponding label may be an examination means; for data "mass," its corresponding label may be an abnormal symptom; for the data "lung cancer", the corresponding label can be the disease category.
In addition, word vectors and groups of word vectors of different length levels may be used to represent data of different granularity in different task levels. The word vector is the representation of the lowest level in the text representation, a single character can be represented by one word vector, and the simplest classification and recognition task can be performed on the character. The word vector combination may be a more advanced representation of the text with specific properties, such as a medical entity, e.g. "liver", "head", where the word vector combination may be a body part in the medical entity, where the individual words may also be encoded as entities. Meanwhile, the word vector combination can be a longer sentence or paragraph, and even a complete medical diagnosis report.
For example, the following is an example of an electronic medical record of gastric ulcer: the patient has no obvious induction to have paroxysmal distending pain in the upper abdomen before more than 1 month, moderate degree, tolerance, hunger pain, night pain, no radiation pain, abdominal distension, belch, acid regurgitation, heartburn, no nausea, vomiting, hematemesis, no black stool, diarrhea, no lumbago, hematuria, no dizziness, headache, no chilly sensation, fever, no palpitation, chest distress, no chest pain, cough and the like. The stomach medicine is orally taken outside the hospital (details are not detailed), the symptoms of stomachache are not obviously relieved, and the clinic visits are performed in our hospital at present and the patient is admitted to the hospital after gastroscopy by the 'gastric ulcer'. For the electronic medical record, the coding representation of the word vector combination with higher hierarchy or coarser granularity is obtained by coding the whole segment of the characters so as to represent the more abstract meaning of the text, so that the segment of the text about the symptom of the gastric ulcer can be used as a sample of the category of the gastric ulcer in the electronic medical record recognition task.
The embodiment provides a medical data labeling method, which includes a step of performing combination processing on adjacent word vectors corresponding to adjacent data with medical relevance after obtaining a word vector corresponding to data with the lowest granularity by performing coding processing on medical data during medical data labeling, so that word vector combinations with different granularity task levels can be obtained, and then labeling the word vectors and the word vector combinations through a labeling model, so that an obtained medical attribute class labeling result contains labeling results of the data with different granularities, thereby enabling the labeling result to be more comprehensive, and facilitating data mining analysis of an electronic medical record.
In one embodiment, the training steps of the annotation model used by the processor are explained. As shown in FIG. 2, the training process of the annotation model includes steps S120 to S160.
Step S120, selecting a preset number of second data from the first data, and performing model training through the second data to obtain a primary training model;
step S140, performing data processing on the remaining first data through the preliminary training model, and selecting third data meeting preset requirements from the remaining first data based on a data processing result;
and step S160, carrying out model optimization processing on the preliminary training model through third data to obtain a labeling model.
The first data are unmarked data, the number of the first data is larger than that of the second data, the preset number can be determined according to actual conditions, model training can be carried out on the second data of the preset number as long as the preset number of the second data is guaranteed, and the obtained preliminary training model has certain data marking capacity.
After the preliminary training model is obtained, other first data can be labeled through the preliminary training model, and a corresponding labeling result is obtained. The accuracy of the labeling result obtained by the preliminary training model can reflect the 'quality' of the training effect of the preliminary training model to a certain extent, when the accuracy meets the corresponding requirement, the preliminary training model can be considered to be capable of accurately labeling data, and the model training effect is good; when the accuracy rate does not meet the corresponding requirement, the preliminary training model can be considered to be incapable of accurately carrying out data annotation, and the model training effect is poor. Therefore, in order to ensure the training effect of the preliminary training model, the embodiment further includes a processing step of optimizing the preliminary training model, and the optimized model is used as the labeling model, so that the accuracy of the output result of the labeling model can be improved.
In one embodiment, since the second data selected from the first data does not include the corresponding label, as shown in fig. 3, in this embodiment, the step S122 to the step S126 of performing model training on the second data to obtain a preliminary training model includes.
Step S122, acquiring a preset tagging word bank, wherein the tagging word bank comprises key medical data to be tagged and medical attribute category tags corresponding to the key medical data;
step S124, adding corresponding medical attribute category labels to data matched with the key medical data in the second data according to the label word stock to obtain medical attribute category labeling results of the second data;
and S126, performing model training on the initial model through the second data and the medical attribute class marking result of the second data to obtain a primary training model.
After the processor selects the second data, the processor may label the selected second data by using a method such as labeling based on a corresponding relationship between the key medical data in the preset labeling thesaurus and the medical attribute category label to obtain a medical attribute category labeling result of the second data. Then, the initial model is trained according to the second data and the corresponding medical attribute class labeling result, and the specific training method may adopt an existing model training method, which is not limited herein.
When the model training is carried out, the second data with the preset quantity are selected and labeled based on the labeled word stock, so that the labeling workload in the training stage can be effectively reduced by reducing the quantity of the data used for carrying out the model training, and the model training efficiency is improved.
In an embodiment, as shown in fig. 4, before performing model training by using the second data to obtain a preliminary training model, the method further includes:
step S121a, acquiring an annotation task corresponding to the initial model;
step S121b, determining medical attribute categories of the medical data to be labeled based on the labeling task, and determining key medical data corresponding to each medical attribute category;
step S121c, establishing a labeling word stock according to the medical attribute category corresponding to the labeling task and the key medical data.
Specifically, the annotation task may be to determine the type of the annotation data, for example, annotate the disease type, annotate the diagnosis scheme, and the like; the labeling task may also be determined for different body parts, for example, labeling data related to the lung, labeling data related to the heart, and the like, which is not limited herein. After the labeling task is determined, the corresponding medical attribute category and the corresponding key medical data are determined, and then a corresponding labeling word stock is established. By establishing the tagging word stock, the category tagging of the data can be conveniently and accurately carried out.
Optionally, after determining the key medical data, an expansion may also be performed based on the existing key medical data, for example, for an existing medical symptom "right upper abdominal pain", an expansion of different pain locations may be performed, specifically, an expansion of "left upper abdominal pain", "left lower abdominal pain", "right lower abdominal pain", and the like. By data expansion, the data in the labeled word library can be expanded.
In the embodiment, the medical attribute category of the medical data to be labeled and the key medical data corresponding to each medical attribute category are determined based on the labeling task, and the labeling lexicon is established, so that the method can be used for performing data labeling on the selected second data, and can be used for rapidly obtaining the labeling data which can be used for training the initial model, thereby performing model training. In the model training process, the labeled second data obtained based on labeling of the labeled lexicon is used as training data for training the initial model, and the model is trained, so that the trained model has the data labeling capacity.
In one embodiment, the data processing result includes a medical attribute category corresponding to the remaining first data and a confidence level. For example, for a certain data, the corresponding data processing result includes a probability that the data belongs to the category a, B% and C% respectively.
In this embodiment, in step S140, selecting third data meeting the preset requirement from the remaining first data based on the data processing result includes: and determining the uncertainty of the remaining first data based on the medical attribute category and the confidence coefficient, and screening the data with the uncertainty reaching a preset index from the remaining first data to serve as third data.
Specifically, the processor firstly determines the uncertainty of the data processing result, and screens third data with the uncertainty reaching a preset index, wherein the third data is used for optimizing the preliminary training model. The uncertainty refers to uncertainty of a medical attribute classification result of the preliminary training model determining data, and the lower the uncertainty is, the more accurate the processing result of the preliminary training model on the medical data is represented, and in addition, the lower the effect of the medical data on the performance improvement of the model is also represented, that is, the medical data can be regarded as low-value data for model optimization. The higher the uncertainty is, the more inaccurate the processing result of the model on the medical data is, and in addition, the higher the effect of the medical data on the performance improvement of the model is, that is, the medical data can be regarded as high-value data for model optimization.
For example, assuming that the medical attribute categories that can be labeled by the preliminary training model include A, B, C, D four categories (or other numbers), for data a, the labeling result of the preliminary training model is that the probabilities of the data a belonging to A, B, C, D category and the background are 95%, 2%, 1%, and 1%, respectively, and since the probability (95%) of the data a belonging to the a category is much greater than the probability (2%, 1%, and 1%) of the data a belonging to other categories, it can be determined that the labeling result of the preliminary training model on the data a is more accurate and has lower uncertainty, the effect of the data a on improving the performance of the preliminary training model is lower, and the data a belongs to low-value data for model optimization.
For data B, the probability that the result of the preliminary training model labeling is that the data B belongs to A, B, C, D types and the probability that the data B belongs to the background is 44%, 46%, 5%, 3% and 2%, respectively, and because the probability (44%) that the data B belongs to the A type and the probability (46%) that the data B belongs to the B type are relatively close, it can be determined that the preliminary training model is difficult to determine whether the data B really belongs to the A type or the B type, namely the preliminary training model labels the data B inaccurately, the uncertainty is higher, the effect of the data B on improving the performance of the preliminary training model is higher, and the data B belongs to high-value data for optimizing the model.
Optionally, when the third data with the uncertainty reaching the preset index is screened, the preset index may be set as a difference value between the highest probability and the second highest probability in the model labeling result. For example, for data C, the probability that the data C belongs to A, B, C, D and the probability of the background is P _ A, P _ B, P _ C, P _ D, P _ bg as a result of the preliminary training model labeling, wherein P _ a > P _ B > P _ C > P _ D > P _ bg, and the preset index is set to m, when P _ a-P _ B < m, the data C can be considered as the third data with the uncertainty reaching the preset index.
It is to be understood that, when screening the target medical data, the preset index may also be screened by using other conditions, which is not limited herein.
The third data with higher model optimization value is determined based on the uncertainty of the data annotation result, so that the model can be optimized, and meanwhile, the model optimization efficiency can be improved.
In one embodiment, as shown in fig. 5, in step S124, adding corresponding medical attribute category labels to the data matching the key medical data in the second data according to the label thesaurus, and obtaining the medical attribute category label result of the second data includes steps S124a to S124 d.
Step S124a, adding corresponding medical attribute category labels to the data matched with the key medical data in the second data by using a labeling method to obtain a labeling result of the second data;
step S124b, the annotation result is pushed to a third party checking platform;
step S124c, receiving the checking information of the annotation result fed back by the third party checking platform;
step S124d, obtaining a medical attribute class labeling result of the second data based on the labeling result and the checking information.
And the processor determines data matched with the key medical data in the second data based on the tagging thesaurus, and adds corresponding medical attribute category tags to obtain tagging results of the second data. After the labeling result is obtained, the processor directly or indirectly sends the second data and the corresponding labeling result to the third-party checking platform, and directly or indirectly receives checking information fed back by the third-party checking platform. And then the processor obtains a final medical attribute class labeling result of the second data based on the checking information of the labeling result. Therefore, the third-party checking is carried out on the labeling result, so that the accuracy of the medical attribute class labeling result can be further improved, and the accuracy of model training is ensured.
Optionally, after the checking information is obtained, the method may further include a step of updating the tagged word library according to the modified tagging result, that is, adding a new entry to the tagged word library or updating the existing entry, so that the tagged word library is richer and more accurate. In addition, the more abundant the entries of the labeled word stock are, the more and more accurate the labeling result of the medical data based on the labeled word stock is, so that the training efficiency and accuracy of the model can be improved.
In one embodiment, the medical property class annotation result of the second data comprises medical property class annotations of data of different granularities. As shown in fig. 6, in step S126, model training is performed on the initial model according to the second data and the medical attribute class labeling result of the second data, and obtaining a preliminary training model includes steps S126a to S126 c.
Step S126a, encoding the second data to obtain a word vector corresponding to the data with the lowest granularity in the second data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities;
step S126b, determining medical attribute category labels corresponding to single word vectors and word vector combinations based on the medical attribute category labels of the data with different granularities;
step S126c, model training is carried out on the initial model through each single word vector and the corresponding medical attribute category label, each word vector combination and the corresponding medical attribute category label, and a preliminary training model is obtained.
The different granularities may specifically refer to different data lengths, such as a single character, a word, a sentence, a paragraph, and the like, and the medical attribute categories corresponding to the data of different granularities may be different, for example, the single character "lung" belongs to a body part, the word "lung nodule" including "lung" belongs to an abnormal type, and the "lung cancer" belongs to a disease type, and the like.
Specifically, by performing encoding processing on the second data, the data in the text form may be converted into a word vector in another encoding form, for example, may be converted into a numerical value or the like. For adjacent word vectors with correlation, corresponding word vector combinations can be formed, and the word vector combinations can be used for representing phrases, phrases or long sentences, paragraphs and the like with specific medical meanings at different granularity levels to respectively form input data (second data) of tasks at corresponding granularity levels. After second data are expressed in the form of word vectors or word vector combinations, the word vectors and medical attribute category labels corresponding to the word vector combinations are respectively determined, and then model training is performed based on the word vectors and the medical attribute category labels corresponding to the word vector combinations to obtain a preliminary training model. Through coding and processing sample medical data, the text representation modes with different granularities are used, the recognition capability of the model on the text on different levels can be improved, and the recognition effect of the model is improved.
In an embodiment, in step S160, performing model optimization processing on the preliminary training model through the third data to obtain a labeled model, including: performing medical attribute category labeling on the third data according to the labeled word bank to obtain a medical attribute category labeling result corresponding to the third data; and training and optimizing the preliminary training model through the third data and the corresponding medical attribute class marking result to obtain a marking model.
Specifically, after the third data is obtained through screening, the third data may be labeled based on the labeled lexicon, so as to obtain a corresponding labeling result. Optionally, the third data may also be labeled by a third party, so that the accuracy may be further improved. And after the medical attribute class labeling result corresponding to the third data is obtained, the initial training model can be trained and optimized, so that the output result of the labeling model obtained through optimization is higher in accuracy.
In one embodiment, as shown in FIG. 7, an example diagram of training an annotation model.
Firstly, a first quantity of unlabelled first data D is obtained, then a second quantity (the second quantity is smaller than the first quantity) of second data D is selected from the first data D, and medical attribute category labeling is carried out on the second data D according to a preset labeling word stock.
Then, the initial model is trained by using the second data D and corresponding labels to obtain a preliminary training model, and the remaining first data (other unlabeled data, namely D-D) is labeled by using the preliminary training model to obtain labeling results of the remaining first data, wherein the labeling results comprise corresponding categories and confidence degrees (namely probabilities), and corresponding uncertainty is obtained.
And then, screening third data d ' with uncertainty reaching a preset index, and labeling the third data d ' according to the labeled word bank to obtain a labeling result corresponding to the third data d '.
And finally, training and optimizing the preliminary training model through the third data d' and the corresponding labeling result to obtain a labeling model.
It should be understood that, under reasonable circumstances, although the steps in the flowcharts referred to in the foregoing embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in each flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 8, a medical data annotation device is provided, which mainly comprises the following modules:
an obtainingmodule 100, configured to obtain medical data to be labeled and a labeling model trained in advance;
theencoding module 200 is configured to perform encoding processing on the medical data to obtain a word vector corresponding to data with the lowest granularity in the medical data, and perform combination processing on adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities;
and thelabeling module 300 is configured to perform data attribute class labeling on each word vector and each word vector combination through a labeling model to obtain a medical attribute class labeling result of the medical data.
This embodiment provides a medical data labeling device, when carrying out medical data labeling, after obtaining the word vector that the data of minimum granularity corresponds through carrying out coding process to medical data, still include the step that carries out the combination processing to the adjacent word vector that the adjacent data that possess medical relevance corresponds, thereby can obtain the word vector combination of different granularity task levels, then label word vector and word vector combination through marking the model, can make the medical property classification mark result that obtains contain the mark result of different granularity data, thereby make the mark result more comprehensive, be favorable to the data mining analysis of electronic medical record.
In one embodiment, as shown in fig. 9, the medical data labeling apparatus further includes amodel training module 400, configured to select a preset number of second data from the first data, and perform model training on the second data to obtain a preliminary training model; performing data processing on the remaining first data through a preliminary training model, and selecting third data meeting preset requirements from the remaining first data based on a data processing result; and performing model optimization processing on the preliminary training model through the third data to obtain a labeling model.
In one embodiment, themodel training module 400 is further configured to obtain a preset tagging lexicon, where the tagging lexicon includes key medical data to be tagged and medical attribute category tags corresponding to the key medical data; adding corresponding medical attribute category labels to data matched with the key medical data in the second data according to the label word stock to obtain medical attribute category labeling results of the second data; and performing model training on the initial model through the second data and the medical attribute class marking result of the second data to obtain a primary training model.
In one embodiment, themodel training module 400 is further configured to obtain an annotation task corresponding to the initial model; determining medical attribute categories of the medical data to be labeled based on the labeling task, and determining key medical data corresponding to the medical attribute categories; and establishing a labeling word bank according to the medical attribute category corresponding to the labeling task and the key medical data.
In one embodiment, themodel training module 400 is further configured to determine an uncertainty of the remaining first data based on the medical attribute category and the confidence level, and filter, from the remaining first data, data with an uncertainty reaching a preset index as the third data.
In one embodiment, themodel training module 400 is further configured to add corresponding medical attribute category labels to data, which is matched with the key medical data, in the second data by using a labeling method, so as to obtain a labeling result of the second data; pushing the annotation result to a third-party checking platform; receiving checking information of the labeling result fed back by the third-party checking platform; and obtaining a medical attribute class labeling result of the second data based on the labeling result and the checking information.
In one embodiment, themodel training module 400 is further configured to perform encoding processing on the second data to obtain a word vector corresponding to data with the lowest granularity in the second data, and perform combining processing on adjacent word vectors corresponding to adjacent data with medical relevance to obtain a word vector combination with different granularities; determining medical attribute category labels corresponding to single word vectors and word vector combinations based on medical attribute category labels of data with different granularities; and performing model training on the initial model through each single word vector, the corresponding medical attribute category label, each word vector combination and the corresponding medical attribute category label to obtain a primary training model.
For the specific definition of the medical data labeling apparatus, reference may be made to the above definition of the medical data labeling method, which is not described herein again. The modules in the medical data labeling apparatus can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring medical data to be labeled and a pre-trained labeling model; coding medical data to obtain word vectors corresponding to data with the lowest granularity in the medical data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities; and carrying out data attribute category labeling on each word vector and each word vector combination through a labeling model to obtain a medical attribute category labeling result of the medical data.
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment. The computer device may specifically be a terminal (or server). As shown in fig. 10, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program which, when executed by the processor, causes the processor to implement the medical data annotation method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of medical data annotation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring medical data to be labeled and a pre-trained labeling model; coding medical data to obtain word vectors corresponding to data with the lowest granularity in the medical data, and combining adjacent word vectors corresponding to adjacent data with medical relevance to obtain word vector combinations with different granularities; and carrying out data attribute category labeling on each word vector and each word vector combination through a labeling model to obtain a medical attribute category labeling result of the medical data.
Those skilled in the art will appreciate that all or a portion of the processes in the methods of the embodiments described above may be implemented by hardware instructions associated with a computer program, which may be stored in a non-volatile computer-readable storage medium that, when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, non-volatile memory may include read-only memory (ROM), programmable ROM (prom), electrically programmable ROM (eprom), electrically erasable programmable ROM (eeprom), or flash memory, volatile memory may include Random Access Memory (RAM) or external cache memory, and by way of illustration and not limitation, DRAM is available in a variety of forms, such as static RAM (sram), Dynamic RAM (DRAM), (sdram), synchronous DRAM (sdram), dynamic RAM (ddrsdram), (rdram), and dynamic RAM (rdram), and/DRAM (rdram), and/or rdram bus (rddram L).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

CN202010181144.7A2020-03-162020-03-16Medical data labeling method, medical data labeling device, storage medium and computer equipmentActiveCN111444686B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010181144.7ACN111444686B (en)2020-03-162020-03-16Medical data labeling method, medical data labeling device, storage medium and computer equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010181144.7ACN111444686B (en)2020-03-162020-03-16Medical data labeling method, medical data labeling device, storage medium and computer equipment

Publications (2)

Publication NumberPublication Date
CN111444686Atrue CN111444686A (en)2020-07-24
CN111444686B CN111444686B (en)2023-07-25

Family

ID=71648735

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010181144.7AActiveCN111444686B (en)2020-03-162020-03-16Medical data labeling method, medical data labeling device, storage medium and computer equipment

Country Status (1)

CountryLink
CN (1)CN111444686B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112331335A (en)*2020-11-182021-02-05沈阳东软智能医疗科技研究院有限公司Triage method, triage device, storage medium and electronic equipment
CN112420150A (en)*2020-12-022021-02-26沈阳东软智能医疗科技研究院有限公司Medical image report processing method and device, storage medium and electronic equipment
CN112614562A (en)*2020-12-232021-04-06联仁健康医疗大数据科技股份有限公司Model training method, device, equipment and storage medium based on electronic medical record
CN112635013A (en)*2020-11-302021-04-09泰康保险集团股份有限公司Medical image information processing method and device, electronic equipment and storage medium
CN114663512A (en)*2022-04-022022-06-24广西科学院 A method and system for precise positioning of medical images based on organ coding

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107622050A (en)*2017-09-142018-01-23武汉烽火普天信息技术有限公司Text sequence labeling system and method based on Bi LSTM and CRF
CN108875072A (en)*2018-07-052018-11-23第四范式(北京)技术有限公司File classification method, device, equipment and storage medium
CN109256216A (en)*2018-08-142019-01-22平安医疗健康管理股份有限公司Medical data processing method, device, computer equipment and storage medium
US20190073447A1 (en)*2017-09-062019-03-07International Business Machines CorporationIterative semi-automatic annotation for workload reduction in medical image labeling
CN109902296A (en)*2019-01-182019-06-18华为技术有限公司 Natural language processing method, training method and data processing equipment
CN110263650A (en)*2019-05-222019-09-20北京奇艺世纪科技有限公司Behavior category detection method, device, electronic equipment and computer-readable medium
CN110277149A (en)*2019-06-282019-09-24北京百度网讯科技有限公司Processing method, device and the equipment of electronic health record
CN110377905A (en)*2019-06-282019-10-25北京百度网讯科技有限公司Semantic expressiveness processing method and processing device, computer equipment and the readable medium of sentence

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20190073447A1 (en)*2017-09-062019-03-07International Business Machines CorporationIterative semi-automatic annotation for workload reduction in medical image labeling
CN107622050A (en)*2017-09-142018-01-23武汉烽火普天信息技术有限公司Text sequence labeling system and method based on Bi LSTM and CRF
CN108875072A (en)*2018-07-052018-11-23第四范式(北京)技术有限公司File classification method, device, equipment and storage medium
CN109256216A (en)*2018-08-142019-01-22平安医疗健康管理股份有限公司Medical data processing method, device, computer equipment and storage medium
CN109902296A (en)*2019-01-182019-06-18华为技术有限公司 Natural language processing method, training method and data processing equipment
CN110263650A (en)*2019-05-222019-09-20北京奇艺世纪科技有限公司Behavior category detection method, device, electronic equipment and computer-readable medium
CN110277149A (en)*2019-06-282019-09-24北京百度网讯科技有限公司Processing method, device and the equipment of electronic health record
CN110377905A (en)*2019-06-282019-10-25北京百度网讯科技有限公司Semantic expressiveness processing method and processing device, computer equipment and the readable medium of sentence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王若佳;魏思仪;王继民;: "BiLSTM-CRF模型在中文电子病历命名实体识别中的应用研究", no. 02*

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112331335A (en)*2020-11-182021-02-05沈阳东软智能医疗科技研究院有限公司Triage method, triage device, storage medium and electronic equipment
CN112331335B (en)*2020-11-182023-07-07沈阳东软智能医疗科技研究院有限公司Triage method and device, storage medium and electronic equipment
CN112635013A (en)*2020-11-302021-04-09泰康保险集团股份有限公司Medical image information processing method and device, electronic equipment and storage medium
CN112635013B (en)*2020-11-302023-10-27泰康保险集团股份有限公司Medical image information processing method and device, electronic equipment and storage medium
CN112420150A (en)*2020-12-022021-02-26沈阳东软智能医疗科技研究院有限公司Medical image report processing method and device, storage medium and electronic equipment
CN112420150B (en)*2020-12-022023-11-14沈阳东软智能医疗科技研究院有限公司Medical image report processing method and device, storage medium and electronic equipment
CN112614562A (en)*2020-12-232021-04-06联仁健康医疗大数据科技股份有限公司Model training method, device, equipment and storage medium based on electronic medical record
CN112614562B (en)*2020-12-232024-05-31联仁健康医疗大数据科技股份有限公司Model training method, device, equipment and storage medium based on electronic medical record
CN114663512A (en)*2022-04-022022-06-24广西科学院 A method and system for precise positioning of medical images based on organ coding

Also Published As

Publication numberPublication date
CN111444686B (en)2023-07-25

Similar Documents

PublicationPublication DateTitle
CN111444686B (en)Medical data labeling method, medical data labeling device, storage medium and computer equipment
US20230092027A1 (en)Method and apparatus for training medical image report generation model, and image report generation method and apparatus
CN112527999B (en)Extraction type intelligent question-answering method and system for introducing knowledge in agricultural field
US10929420B2 (en)Structured report data from a medical text report
CN109871538A (en) A Named Entity Recognition Method for Chinese Electronic Medical Records
CN111627512A (en)Recommendation method and device for similar medical records, electronic equipment and storage medium
CN116737879A (en)Knowledge base query method and device, electronic equipment and storage medium
CN112883157A (en)Method and device for standardizing multi-source heterogeneous medical data
CN113657109B (en) Method, apparatus and computer device for standardization of model-based clinical terminology
CN110472049B (en)Disease screening text classification method, computer device and readable storage medium
CN112530550A (en)Image report generation method and device, computer equipment and storage medium
CN112035614B (en)Test set generation method, device, computer equipment and storage medium
Xu et al.Hybrid reinforced medical report generation with m-linear attention and repetition penalty
CN113657105A (en)Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN114676260A (en)Human body bone motion rehabilitation model construction method based on knowledge graph
WO2025067273A1 (en)Method and apparatus for recommending medical image report, and device and storage medium
Zhang et al.Visual prior-based cross-modal alignment network for radiology report generation
Vagena et al.Semantic aware representation learning for optimizing image retrieval systems in radiology
CN113111660A (en)Data processing method, device, equipment and storage medium
CN118969307A (en) Image-text cross-modal association matching method and system based on NLP technology
CN116564483A (en)Medical image report generation method, device and computer equipment
CN114121202A (en)Disease auxiliary classification system and method based on medical record AI semantic analysis
Huang et al.An annotation model on end-to-end chest radiology reports
Won et al.Teddysum at the NTCIR-18 HIDDEN-RAD Task: Using RAG and Tree-of-Thought for Causal Explanation Generation
Guo et al.The Study of Named Entity Identification in Chinese Electronic Medical Records Based on Multi-tasking

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
TA01Transfer of patent application right

Effective date of registration:20220107

Address after:430206 22 / F, building C3, future science and technology building, 999 Gaoxin Avenue, Donghu New Technology Development Zone, Wuhan, Hubei Province

Applicant after:Wuhan Zhongke Medical Technology Industrial Technology Research Institute Co.,Ltd.

Address before:Room 3674, 3 / F, 2879 Longteng Avenue, Xuhui District, Shanghai, 200232

Applicant before:SHANGHAI UNITED IMAGING INTELLIGENT MEDICAL TECHNOLOGY Co.,Ltd.

TA01Transfer of patent application right
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp