Disclosure of Invention
Based on the problems, the invention provides a new technical scheme, which can more accurately and comprehensively dig out words with association relation in medical texts, so that a medical word bank constructed according to the words with association relation is more accurate and comprehensive.
In view of the above, an aspect of the present invention provides a medical information processing method, including: performing word segmentation on a plurality of medical texts, and clustering the plurality of medical texts; determining the association degree of every two medical texts according to the words of every two medical texts in the medical texts of the same category; judging whether words of any two medical texts in the medical texts of the same category have an association relation or not according to the association degree of every two medical texts; and when the judgment result is yes, performing association storage on the words with the association relation.
In the technical scheme, the association degree of every two medical texts is determined according to the words in every two medical texts in the medical texts of the same category, whether an association relationship exists between any two words in the medical texts of the same category is judged according to the association degree of every two medical texts, and the words with the association relationship are stored in an association manner, for example, in a medical word bank, so as to construct a more complete medical word bank. For example, the words in the a-medical text are: cold and fever, the words in the B medical text are: fever and cough, the words in the C medical text are: cough and cold, it can be seen that a and B have similar words: fever and fever, 30% correlation between a and B, with the same words in B and C: in the cough, the association degree between B and C is 50%, and A and C do not have the same or similar words, but because A and B have an association, the association between A and C can be determined, that is, the association between the words of A and C exists. Therefore, the method and the device can further dig out the words with the implicit association relationship, so that the words with the association relationship in the medical text can be more accurately and comprehensively duout. Furthermore, a search engine of medical treatment information can be constructed according to the words with the incidence relation, or automatic analysis of medical treatment text information is realized, and convenience is provided for outpatients doctors and patients to inquire diseases and symptoms.
Preferably, the plurality of medical texts may be electronic medical records in a medical system of a hospital, or may be obtained from a medical professional website by using a crawler program. Because the scale of the medical texts is larger, the distributed file system can store the medical texts.
In the above technical solution, preferably, the step of performing association storage on the words with association relationship further includes: determining the association degree of words in any two medical texts according to the association degree of any two medical texts; and storing the association degree of the words in any two medical texts.
In the technical scheme, the association degree of the words in any two medical texts is determined according to the association degree of any two medical texts, specifically, the association degree of any two medical texts can be used as the association degree of the words in any two medical texts, and the association degree of the words in any two medical texts can be calculated according to a preset algorithm, so that the association degree of the words can be reflected more accurately and intuitively according to the association degree of the words. For example, the words in the a-medical text are: cold and fever, the words in the C medical text are: cough and coolness, the degree of association between a and C is 10%, and the degree of association between cold and cough is 10%.
In any one of the above technical solutions, preferably, the step of segmenting the plurality of medical texts specifically includes: and performing word segmentation on the medical texts according to the dictionary and the parts of speech of the words in the medical texts.
In the technical scheme, the words of the medical texts can be cut according to words and parts of speech in a dictionary (preferably a medical dictionary), specifically, the words of the medical texts are cut according to the words in the dictionary, if the words in the medical texts do not exist in the dictionary, whether the words are associated with front and rear words or not is judged according to the parts of speech of the words, and whether new words need to be combined or not is judged, so that the situations of word miscut and word omission are effectively avoided, and the accuracy and the comprehensiveness of word cutting are further ensured.
In any one of the above technical solutions, preferably, the step of clustering the plurality of medical texts specifically includes: clustering the plurality of medical texts according to international disease classification and K-means algorithm.
In the technical scheme, the plurality of medical texts can be clustered according to International Classification of Disease (ICD) and a K-means algorithm, and since the medical texts of the same category obtained by clustering have the same Disease, the possibility that the words of the medical texts of the same category obtained by clustering are associated is high, and then the medical texts of the same category are further processed to ensure the processing speed.
In any one of the above technical solutions, preferably, the step of performing association storage on the words with association relations specifically includes: and storing the words with the association relation according to the attributes of the words with the association relation.
In the technical scheme, the word is stored according to the attribute of the word with the association relationship, for example, the attribute of the word is as follows: the medical information storage system comprises body parts (such as heads and limbs), predicates (such as pains and strains), diseases (such as fever and heart diseases), medicines (such as Gregorian tablets and glucose injection), treatment means (such as drip and anesthesia), and neglected words (such as home and patient) which do not contribute to information extraction), so that the storage of related words is more orderly.
Another aspect of the present invention provides a medical information processing apparatus including: the processing unit is used for segmenting a plurality of medical texts and clustering the medical texts; the first determination unit is used for determining the association degree of every two medical texts according to the words of every two medical texts in the medical texts of the same category; the judging unit is used for judging whether words of any two medical texts in the medical texts of the same category have an association relation or not according to the association degree of every two medical texts; and the storage unit is used for associating and storing the words with the association relation when the judgment result is yes.
In the technical scheme, the association degree of every two medical texts is determined according to the words in every two medical texts in the medical texts of the same category, whether an association relationship exists between any two words in the medical texts of the same category is judged according to the association degree of every two medical texts, and the words with the association relationship are stored in an association manner, for example, in a medical word bank, so as to construct a more complete medical word bank. For example, the words in the a-medical text are: cold and fever, the words in the B medical text are: fever and cough, the words in the C medical text are: cough and cold, it can be seen that a and B have similar words: fever and fever, 30% correlation between a and B, with the same words in B and C: in the cough, the association degree between B and C is 50%, and A and C do not have the same or similar words, but because A and B have an association, the association between A and C can be determined, that is, the association between the words of A and C exists. Therefore, the method and the device can further dig out the words with the implicit association relationship, so that the words with the association relationship in the medical text can be more accurately and comprehensively duout. Furthermore, a search engine of medical treatment information can be constructed according to the words with the incidence relation, or automatic analysis of medical treatment text information is realized, and convenience is provided for outpatients doctors and patients to inquire diseases and symptoms.
Preferably, the plurality of medical texts may be electronic medical records in a medical system of a hospital, or may be obtained from a medical professional website by using a crawler program. Because the scale of the medical texts is larger, the distributed file system can store the medical texts.
In the above technical solution, preferably, the storage unit includes: the second determining unit is used for determining the association degree of the words in any two medical texts according to the association degree of any two medical texts; the storage unit is specifically configured to store the association degrees of the words in any two medical texts.
In the technical scheme, the association degree of the words in any two medical texts is determined according to the association degree of any two medical texts, specifically, the association degree of any two medical texts can be used as the association degree of the words in any two medical texts, and the association degree of the words in any two medical texts can be calculated according to a preset algorithm, so that the association degree of the words can be reflected more accurately and intuitively according to the association degree of the words. For example, the words in the a-medical text are: cold and fever, the words in the C medical text are: cough and coolness, the degree of association between a and C is 10%, and the degree of association between cold and cough is 10%.
In any one of the above technical solutions, preferably, the processing unit includes: and the word cutting unit is used for cutting words of the medical texts according to the dictionary and the parts of speech of the words in the medical texts.
In the technical scheme, the words of the medical texts can be cut according to words and parts of speech in a dictionary (preferably a medical dictionary), specifically, the words of the medical texts are cut according to the words in the dictionary, if the words in the medical texts do not exist in the dictionary, whether the words are associated with front and rear words or not is judged according to the parts of speech of the words, and whether new words need to be combined or not is judged, so that the situations of word miscut and word omission are effectively avoided, and the accuracy and the comprehensiveness of word cutting are further ensured.
In any one of the above technical solutions, preferably, the processing unit includes: and the clustering unit is used for clustering the plurality of medical texts according to the international disease classification and the K-means algorithm.
In the technical scheme, the plurality of medical texts can be clustered according to International Classification of Disease (ICD) and a K-means algorithm, and since the medical texts of the same category obtained by clustering have the same Disease, the possibility that the words of the medical texts of the same category obtained by clustering are associated is high, and then the medical texts of the same category are further processed to ensure the processing speed.
In any of the foregoing technical solutions, preferably, the storage unit is specifically configured to store the words having an association relationship according to the attribute of the words having an association relationship.
In the technical scheme, the word is stored according to the attribute of the word with the association relationship, for example, the attribute of the word is as follows: the medical information storage system comprises body parts (such as heads and limbs), predicates (such as pains and strains), diseases (such as fever and heart diseases), medicines (such as Gregorian tablets and glucose injection), treatment means (such as drip and anesthesia), and neglected words (such as home and patient) which do not contribute to information extraction), so that the storage of related words is more orderly.
Through the technical scheme of the invention, the words with the association relation in the medical text can be more accurately and comprehensively excavated, so that the medical word bank constructed according to the words with the association relation is more accurate and comprehensive.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Fig. 1 shows a flow diagram of a medical information processing method according to an embodiment of the present invention.
As shown in fig. 1, a medical information processing method according to an embodiment of the present invention includes:
102, performing word segmentation on a plurality of medical texts, and clustering the plurality of medical texts;
104, determining the association degree of every two medical texts according to the words of every two medical texts in the medical texts of the same category;
step 106, judging whether words of any two medical texts in the medical texts of the same category have an association relation according to the association degree of each two medical texts, if so, enteringstep 108, otherwise, ending the process;
and step 108, performing association storage on the words with the association relation.
In the technical scheme, the association degree of every two medical texts is determined according to the words in every two medical texts in the medical texts of the same category, whether an association relationship exists between any two words in the medical texts of the same category is judged according to the association degree of every two medical texts, and the words with the association relationship are stored in an association manner, for example, in a medical word bank, so as to construct a more complete medical word bank. For example, the words in the a-medical text are: cold and fever, the words in the B medical text are: fever and cough, the words in the C medical text are: cough and cold, it can be seen that a and B have similar words: fever and fever, 30% correlation between a and B, with the same words in B and C: in the cough, the association degree between B and C is 50%, and A and C do not have the same or similar words, but because A and B have an association, the association between A and C can be determined, that is, the association between the words of A and C exists. Therefore, the method and the device can further dig out the words with the implicit association relationship, so that the words with the association relationship in the medical text can be more accurately and comprehensively duout. Furthermore, a search engine of medical treatment information can be constructed according to the words with the incidence relation, or automatic analysis of medical treatment text information is realized, and convenience is provided for outpatients doctors and patients to inquire diseases and symptoms.
Preferably, the plurality of medical texts may be electronic medical records in a medical system of a hospital, or may be obtained from a medical professional website by using a crawler program. Because the scale of the medical texts is larger, the distributed file system can store the medical texts.
In the above technical solution, preferably, step 108 further includes: determining the association degree of words in any two medical texts according to the association degree of any two medical texts; and storing the association degree of the words in any two medical texts.
In the technical scheme, the association degree of the words in any two medical texts is determined according to the association degree of any two medical texts, specifically, the association degree of any two medical texts can be used as the association degree of the words in any two medical texts, and the association degree of the words in any two medical texts can be calculated according to a preset algorithm, so that the association degree of the words can be reflected more accurately and intuitively according to the association degree of the words. For example, the words in the a-medical text are: cold and fever, the words in the C medical text are: cough and coolness, the degree of association between a and C is 10%, and the degree of association between cold and cough is 10%.
In any one of the above technical solutions, preferably, the step of segmenting the plurality of medical texts specifically includes: and performing word segmentation on the medical texts according to the dictionary and the parts of speech of the words in the medical texts.
In the technical scheme, the words of the medical texts can be cut according to words and parts of speech in a dictionary (preferably a medical dictionary), specifically, the words of the medical texts are cut according to the words in the dictionary, if the words in the medical texts do not exist in the dictionary, whether the words are associated with front and rear words or not is judged according to the parts of speech of the words, and whether new words need to be combined or not is judged, so that the situations of word miscut and word omission are effectively avoided, and the accuracy and the comprehensiveness of word cutting are further ensured. Preferably, the words obtained by segmenting the medical text are medical words, so as to avoid interference of irrelevant words (such as every day, patients, home) in determining the relevance of the medical text.
In any one of the above technical solutions, preferably, the step of clustering the plurality of medical texts specifically includes: clustering the plurality of medical texts according to international disease classification and K-means algorithm.
In the technical scheme, the plurality of medical texts can be clustered according to International Classification of Disease (ICD) and a K-means algorithm, and since the medical texts of the same category obtained by clustering have the same Disease, the possibility that the words of the medical texts of the same category obtained by clustering are associated is high, and then the medical texts of the same category are further processed to ensure the processing speed.
In any of the above technical solutions, preferably, step 108 specifically includes: and storing the words with the association relation according to the attributes of the words with the association relation.
In the technical scheme, the word is stored according to the attribute of the word with the association relationship, for example, the attribute of the word is as follows: the medical information storage system comprises body parts (such as heads and limbs), predicates (such as pains and strains), diseases (such as fever and heart diseases), medicines (such as Gregorian tablets and glucose injection), treatment means (such as drip and anesthesia), and neglected words (such as home and patient) which do not contribute to information extraction), so that the storage of related words is more orderly.
Fig. 2 shows a schematic configuration diagram of a medical information processing apparatus according to an embodiment of the present invention.
As shown in fig. 2, a medical information processing apparatus 200 according to an embodiment of the present invention includes: the processing unit 202 is configured to perform word segmentation on a plurality of medical texts and perform clustering on the plurality of medical texts; the first determining unit 204 is configured to determine, according to words of every two medical texts in the medical texts of the same category, a degree of association between every two medical texts; the judging unit 206 is configured to judge whether words of any two medical texts in the medical texts of the same category have an association relationship according to the association degree of each two medical texts; and a storage unit 208, configured to, if the determination result is yes, associate and store the words having an association relationship.
In the technical scheme, the association degree of every two medical texts is determined according to the words in every two medical texts in the medical texts of the same category, whether an association relationship exists between any two words in the medical texts of the same category is judged according to the association degree of every two medical texts, and the words with the association relationship are stored in an association manner, for example, in a medical word stock, so as to construct a more perfect medical word stock. For example, the words in the a-medical text are: cold and fever, the words in the B medical text are: fever and cough, the words in the C medical text are: cough and cold, it can be seen that a and B have similar words: fever and fever, 30% correlation between a and B, with the same words in B and C: in the cough, the association degree between B and C is 50%, and A and C do not have the same or similar words, but because A and B have an association, the association between A and C can be determined, that is, the association between the words of A and C exists. Therefore, the method and the device can further dig out the words with the implicit association relationship, so that the words with the association relationship in the medical text can be more accurately and comprehensively duout. Furthermore, a search engine of medical treatment information can be constructed according to the words with the incidence relation, or automatic analysis of medical treatment text information is realized, and convenience is provided for outpatients doctors and patients to inquire diseases and symptoms.
Preferably, the plurality of medical texts may be electronic medical records in a medical system of a hospital, or may be obtained from a medical professional website by using a crawler program. Because the scale of the medical texts is larger, the distributed file system can store the medical texts.
In the above technical solution, preferably, the storage unit 208 includes: the second determining unit 2082, configured to determine association degrees of words in any two medical texts according to the association degrees of any two medical texts; the storage unit 208 is specifically configured to store the association degrees of the words in any two medical texts.
In the technical scheme, the association degree of the words in any two medical texts is determined according to the association degree of any two medical texts, specifically, the association degree of any two medical texts can be used as the association degree of the words in any two medical texts, and the association degree of the words in any two medical texts can be calculated according to a preset algorithm, so that the association degree of the words can be reflected more accurately and intuitively according to the association degree of the words. For example, the words in the a-medical text are: cold and fever, the words in the C medical text are: cough and coolness, the degree of association between a and C is 10%, and the degree of association between cold and cough is 10%.
In any of the above technical solutions, preferably, the processing unit 202 includes: the word segmentation unit 2022 is configured to segment words of the plurality of medical texts according to the dictionary and parts of speech of the words in the plurality of medical texts.
In the technical scheme, the words of the medical texts can be cut according to words and parts of speech in a dictionary (preferably a medical dictionary), specifically, the words of the medical texts are cut according to the words in the dictionary, if the words in the medical texts do not exist in the dictionary, whether the words are associated with front and rear words or not is judged according to the parts of speech of the words, and whether new words need to be combined or not is judged, so that the situations of word miscut and word omission are effectively avoided, and the accuracy and the comprehensiveness of word cutting are further ensured. Preferably, the words obtained by segmenting the medical text are medical words, so as to avoid interference of irrelevant words (such as every day, patients, home) in determining the relevance of the medical text.
In any of the above technical solutions, preferably, the processing unit 202 includes: a clustering unit 2024, configured to cluster the plurality of medical texts according to international disease classification and K-means algorithm.
In the technical scheme, the plurality of medical texts can be clustered according to International Classification of Disease (International Classification of Disease) and a K-means algorithm, and since the medical texts of the same category obtained by clustering have the same Disease, the possibility of association among words of the medical texts of the same category obtained by clustering is high, and then the medical texts of the same category are further processed to ensure the processing speed.
In any of the foregoing technical solutions, preferably, the storage unit 208 is specifically configured to store the words having an association relationship according to the attribute of the words having an association relationship.
In the technical scheme, the word is stored according to the attribute of the word with the association relationship, for example, the attribute of the word is as follows: the medical information storage system comprises body parts (such as heads and limbs), predicates (such as pains and strains), diseases (such as fever and heart diseases), medicines (such as Gregorian tablets and glucose injection), treatment means (such as drip and anesthesia), and neglected words (such as home and patient) which do not contribute to information extraction), so that the storage of related words is more orderly.
Fig. 3 shows a schematic diagram of a medical information processing apparatus according to an embodiment of the invention.
As shown in fig. 3, the medical information processing apparatus 300 first obtains a medical text from a medical professional website by using a crawler technology, and obtains an electronic medical record from a medical system in a hospital, and since the amounts of information obtained from the medical professional website and the medical system are large, the medical text and the electronic medical record obtained from the medical professional website are stored in a distributed file system as a plurality of medical texts, word segmentation and clustering are performed on the plurality of medical texts, and then the association degree of each two medical texts is calculated by using a Jacard method according to words in each two medical texts in the same category, for example, for two medical texts a and B, the word after word segmentation of a medical text is: "patient", "sore throat and itching throat", "no phlegm", "stomach distension", "lumbago", the words after the word segmentation of the B medical text are: "dry cough", "pharyngalgia and pharynx itch", "no phlegm", "stomachache", "waist soreness" and "fear of cold", exactly the same word pair can be obtained by calculation: "pharyngalgia pharynx itch" and "pharyngalgia pharynx itch", "no phlegm" and "no phlegm"; and the higher similarity terms are "gastrectasia" and "stomachache", "lumbago" and "soreness of waist". And then determining whether any two medical texts in the medical texts of the same category have an association relationship by adopting a vector cosine method, thereby obtaining the association relationship of some words, wherein the association relationship can not be obtained by calculating the similarity by adopting a Jacard method. For example, the two medical texts a and B and the other medical text C, C are the following words after word segmentation: the medical records A and C have an incidence relation through calculation, so that the words in the A and C have an incidence relation, for example, the words in the A and C have an incidence relation with the words in the tonsil inflammation, and then the words in the incidence relation are stored in a medical word stock, so that the medical word stock facing to a medical actual scene is constructed.
The technical scheme of the invention is explained in detail in the above with the help of the attached drawings, and by analyzing the real data (i.e. medical history) in the medical system of the hospital and the medical text in the medical professional website, words with association relation in the medical text can be more accurately and comprehensively excavated, so that a medical word stock facing to the medical actual scene is constructed.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.