Method and system for visual report based on medical documentTechnical Field
The invention belongs to the technical field of data or information processing, particularly relates to processing of medical big data, and more particularly relates to a method and a system for a visual report of a medical document.
Background
In the medical industry, medical data includes specific diagnosis and treatment data of hospitals, and the data has high general speciality and is mainly stored in each department of the hospitals, so that common channels are not easy to acquire. However, since all the medical document data (invoice, prescription, etc.) is held by the patient, the collection is easy, and the data can be acquired from the insurance company settlement channel. As a result, such medical document data is growing in geometric progression. The following problems are: the medical document big data visualization system is extremely deficient.
Because when facing massive data, browsing the data one by one becomes meaningless. A visualization system is required to generate. For the visualization system, data and data dimensions of different industries bring the weather difference of the final report presentation.
With the rise of the big data concept, various industries have highly paid attention to the collection and storage of various data in the industry. Known big data analysis has certain application, for example, patent application No. 201610497249 relates to a method for establishing a disease cloud picture based on big data analysis, and patent application No. 201710150587.8 relates to an intelligent environment-friendly big data visualization method. But the medical big data has its specificity, such as related to disease, disease category, patient's age, sex, etc. How to present these different dimensions in a unified manner for analysis of disease prevention and control is a problem to be solved.
Disclosure of Invention
In view of the above requirements, the present invention provides a method for a visual report based on medical documents.
The invention discloses a method for a visual report based on a medical document, which mainly comprises the following procedures:
1) collecting data of medical documents
2) Separating medical document data into disease data and patient data
3) Analyzing the disease category data, adopting a clustering algorithm, and then presenting the analysis result in a disease category distribution map mode
4) Analyzing the data of the disease crowd, adopting a crowd attribute label algorithm and an association rule mining algorithm, and then presenting the analysis result by using a network relation graph method of the disease crowd
The method for analyzing the disease category data comprises the following steps:
the source of the disease data is obtained from the prescription on the medical document and the name of the disease in the proof of diagnosis.
ICD10 medical directory is mainly used as a tree-structured directory, and then a clustering algorithm is performed on the directory tree according to specific diseases. The specific process is as follows:
A) sorting the icd10 directory in a relational data mode into three levels including DS1, DS2 and DS3
B) Positioning to specific disease record DS3 by similarity search and error correction
The specific method of searching is to traverse the diseases on the document and calculate the edit distance between the diseases and DS3 level diseases.
The algorithm is as follows:
B1) a length of str1 or str2 of 0 returns the length of another string. if (str1.length ═ 0) return
B2) The matrix d of (n +1) × (m +1) is initialized and the values of the first row and column are incremented from 0. Two strings (of order n x m) are scanned, if: str1[ i ] ═ str2[ j ], which was recorded as 0 using temp. Otherwise temp is noted as 1. Then, the matrix d [ i, j ] is assigned to the minimum value of d [ i-1, j ] +1, d [ i, j-1] +1, d [ i-1, j-1] + temp.
B3) After scanning, the last value d n m of the matrix is returned, i.e. their distance.
B4) Comparing the distance with all DS3 levels, the distance is 0 or below a threshold, hit, and the disease on the document can be considered as the disease of DS 3.
C) For DS3, the number of patients was recorded.
D) Summarizing all times of DS3 level on DS2 level; all data of DS2 are summarized on DS1 level. Thus, the patient frequency can be obtained no matter which level of data.
E) Finally, the disease frequency and the number of people can be summarized according to a tree structure.
Through the method, the disease category distribution map is finally presented in a visual report based on the disease category distribution map. The invention adopts a rectangular tree diagram mode, presents the morbidity quantity of various diseases, and the larger the area of the area, the more morbidity is represented. The main purpose of the rectangular tree diagram is to clearly see the overall situation in one diagram, and the size of the diagram is determined by the size of each component and has the function of grouping.
The specific drawing method comprises the following steps: firstly, calculating the total proportion of the diseases according to the number of the diseases of the third level, and then determining the area of each disease of the third level on a rectangle according to the total proportion number. Once the rectangular area of all third-level diseases is determined, the area of second-level diseases and the area of first-level diseases are also determined.
Disease data is classified into three levels according to the catalog of icd 10. First order disease, presented with areas of different color. As illustrated in the illustration shown in fig. 2. Second and third level diseases, both represented by subdivided regions in the first level region. Clicking on any of the first level areas focuses on this level to specifically reveal its information. Such as respiratory disease after clicking, this category is presented with further information.
The method for analyzing the data of the patient population comprises the following steps:
sources of data include: the tree structure of each disease (obtained by the disease data analysis method) and the population attribute label of the patient data.
The data source of the patient population attribute label is the age, the sex and the medical insurance card number of the patient from the medical document (such as a medical record card), and then different user groups are formed according to the age and the sex.
Then, the data of both diseases and patients are used for association rule mining. The specific method mainly adopts Apriori algorithm to perform association rule mining.
The Apriori algorithm is an algorithm for mining a frequent item set of boolean association rules, which has the most influence. Is based on the fact that: the algorithm uses a priori knowledge of the nature of the frequent itemset. Apriori uses an iterative approach called layer-by-layer search, where a set of k-terms is used to explore a set of (k +1) -terms. First, a set of frequent 1-item sets is found. This set is denoted L1。L1Collections L for finding frequent 2-item sets2And L is2For finding L3And so on until a frequent k-term set cannot be found. Find each LkOne database scan is required.
All transactions are scanned first, resulting in a 1-item set C1, and a frequent 1-item set is obtained by filtering out the item sets that do not meet the requirements according to the support requirements. The recursive operation is then performed:
knowing the frequent K-item set (the frequent 1-item set is known), connecting all possible K +1_ items according to the items in the frequent K-item set, and pruning (if all K item subsets of the K +1_ item set can not meet the support degree condition, the K +1_ item set is pruned) to obtain Ck+1Set of items, then filter out the Ck+1Items in the item set that do not satisfy the support condition result in a frequent k + 1-item set. If C is obtainedk+1If the set of items is empty, the algorithm ends.
The connection method comprises the following steps: suppose LkAll items in the set of items are arranged in the same order, then if L isk[i]And Lk[j]The first k-1 terms in (A) are all identical, while the k-th term is different, then Lk[i]And Lk[j]Are connectable. Such as L2The { I1, I2} and { I1, I3} in (1) are connectable, and the connection results in { I1, I2, I3}, but { I1, I2} and { I2, I3} are not connectable, otherwise, the repeated items in the item set will occur.
Further examples are given with respect to pruning, as illustrated by L2Generation of K3In the process of (3), the 3_ item set obtained by enumeration comprises { I1, I2, I3}, { I1, I3, I5}, { I2, I3, I4}, { I2, I3, I5}, { I2, I4, I5}, but the { I3, I4} and { I4, I5} do not appear in L2In (b), so { I2, I3, I4}, { I2, I3, I5}, { I2, I4, I5} is pruned.
Through the method, the network relationship graph of the disease population is finally presented. Wherein, the internal relation between the disease category and the attribute of the susceptible population can be found out by the association rule mining. The specific method comprises the following steps:
firstly, for each disease, a primary code DS1 of the disease category can be calculated, a group code PG of the crowd attribute of the patient can also be calculated, and a one-dimensional array is constructed and put in [ DS1, PG ];
then, scanning all disease records, and filling the input of the one-dimensional array of the first step into a new array to construct a high-dimensional array;
and thirdly, performing association rule mining calculation on the high-dimensional array to finally obtain the frequency weight value FP of the DS1 and PG different combined data. Since the relationship of high frequency is analyzed, 80 sets of results of the highest frequency are taken and filled into Gexf format data. Gexf is a special xml language used to describe complex network relationships, and generally specifies nodes (nodes) and then establishes relationships between nodes (edges). DS3, PG is filled as Node of Gexf, and its corresponding FP value is filled as Edge.
And finally, rendering the relational graph by using Gexf data. Where red is the disease category and deep blue is the demographic attribute. Wherein the demographic attributes are grouped by age group and gender. Disease categories, classified by the level one category of icd 10. After the weight of the relationship between a group of people and a disease category is calculated, a weight value FP is displayed on the chain. Higher weight values indicate that such a population is more susceptible to the disease. Since the FP value is the mined result according to the frequency relationship between the crowd property PG and the disease code DS1, a high FP value represents that the relationship between the crowd property PG and the disease is high-frequency occurrence in the data result.
Correspondingly, the invention provides a system for a visualized report based on big data analysis of a medical document, which mainly comprises the following modules:
1) a data acquisition and classification module: the medical bill data acquisition system is used for acquiring data of a medical bill and dividing the data of the medical bill into disease data and patient data;
2) a data analysis module: the system comprises a disease category data analysis module and a disease crowd data analysis module respectively;
3) the visual report module: and respectively presenting the analysis result by using the disease category distribution map and the network relationship map of the disease population.
The invention provides a solution for analyzing different dimensions in a unified way, which is convenient for disease prevention and control aiming at the specificity of medical big data (including diseases, disease categories, and the attributes of patients such as age and sex). The problem that a medical document big data visualization system is deficient due to the fact that the medical document data are increased in a geometric progression is solved, and the method has good application and popularization values.
Drawings
FIG. 1 is a basic flow diagram of the method and system of the present invention.
FIG. 2 is a schematic view showing the number of diseases (illustration)
FIG. 3 is a pictorial representation of the number of respiratory diseases (illustration)
FIG. 4 is a graph of intrinsic contact network relationships for disease categories and susceptibility population attributes (illustration)
Detailed Description
The invention is further illustrated, but not limited, by the following description of specific embodiments.
First, the main process of the method of the present invention
1) Collecting data of medical documents
2) Separating medical document data into disease data and patient data
3) Analyzing the disease category data, adopting a clustering algorithm, and then presenting the analysis result in a disease category distribution map mode
4) Analyzing the data of the disease crowd, adopting a crowd attribute label algorithm and an association rule mining algorithm, and then presenting the analysis result by using a network relation graph method of the disease crowd
Second, description of analytical methods
1. Method for analyzing data of disease category distribution map
The source of the disease data is obtained from the prescription on the medical document and the name of the disease in the proof of diagnosis.
The ICD10 medical directory is mainly used as a tree structure directory, and then a clustering algorithm is performed on a specific disease on the directory tree, wherein the process is as follows:
A) sorting the icd10 directory in a relational data mode into three levels including DS1, DS2 and DS3
B) Positioning to specific disease record DS3 by similarity search and error correction
The specific method of searching is to traverse the diseases on the document and calculate the edit distance between the diseases and DS3 level diseases.
The algorithm is as follows:
B1) a length of str1 or str2 of 0 returns the length of another string. if (str1.length ═ 0) return
B2) The matrix d of (n +1) × (m +1) is initialized and the values of the first row and column are incremented from 0. Two strings (of order n x m) are scanned, if: str1[ i ] ═ str2[ j ], which was recorded as 0 using temp. Otherwise temp is noted as 1. Then, the matrix d [ i, j ] is assigned to the minimum value of d [ i-1, j ] +1, d [ i, j-1] +1, d [ i-1, j-1] + temp.
B3) After scanning, the last value d n m of the matrix is returned, i.e. their distance
B4) Comparing the distance with all DS3 levels, wherein the distance is 0 or below a threshold value, and the hit can be considered that the disease on the document is the disease of DS3
C) For DS3, the number of patients was recorded
D) Summarizing all times of DS3 level on DS2 level; all data of DS2 are summarized on DS1 level. Thus, the patient frequency can be obtained no matter which level of data.
E) Finally, the disease frequency and number of people can be summarized according to the tree structure
2. Method for analyzing data of network relation graph of patient population
Two sources of data are needed, namely the tree structure of each disease (obtained by data analysis of the disease category distribution map) and the population attribute label of the patient data.
The data source of the patient population attribute label is the age, the sex and the medical insurance card number of the patient from the medical document (such as a medical record card), and then different user groups are formed according to the age and the sex.
Then, the data of both diseases and patients are used for association rule mining.
The Apriori algorithm is mainly adopted for mining the association rule.
The Apriori algorithm is an algorithm for mining a frequent item set of boolean association rules, which has the most influence. Is based on the fact that: the algorithm uses a priori knowledge of the nature of the frequent itemset. Apriori uses an iterative approach called layer-by-layer search, where a set of k-terms is used to explore a set of (k +1) -terms. First, a set of frequent 1-item sets is found. This set is denoted L1。L1Collections L for finding frequent 2-item sets2And L is2For finding L3And so on until a frequent k-term set cannot be found. Find each LkOne database scan is required.
The idea of the algorithm is briefly described below. Simply stated, if set I is not a frequent item set, then all larger sets that contain set I are unlikely to be frequent item sets.
The algorithm raw data is as follows:
the basic process of the algorithm is as follows:
all transactions are scanned first, resulting in a 1-item set C1, and a frequent 1-item set is obtained by filtering out the item sets that do not meet the requirements according to the support requirements.
The following recursion operations are performed:
knowing the frequent K-item set (the frequent 1-item set is known), connecting all possible K +1_ items according to the items in the frequent K-item set, and pruning (if all K item subsets of the K +1_ item set can not meet the support degree condition, the K +1_ item set is pruned) to obtain Ck+1Set of items, then filter out the Ck+1Items in the item set that do not satisfy the support condition result in a frequent k + 1-item set. If C is obtainedk+1If the set of items is empty, the algorithm ends.
The connection method comprises the following steps: suppose LkAll items in the set of items are arranged in the same order, then if L isk[i]And Lk[j]The first k-1 terms in (A) are all identical, while the k-th term is different, then Lk[i]And Lk[j]Are connectable. Such as L2The { I1, I2} and { I1, I3} in (1) are connectable, and the connection results in { I1, I2, I3}, but { I1, I2} and { I2, I3} are not connectable, otherwise, the repeated items in the item set will occur.
Further examples are given with respect to pruning, as illustrated by L2Generation of K3In the process of (3), the 3_ item set obtained by enumeration comprises { I1, I2, I3}, { I1, I3, I5}, { I2, I3, I4}, { I2, I3, I5}, { I2, I4, I5}, but the { I3, I4} and { I4, I5} do not appear in L2In (b), so { I2, I3, I4}, { I2, I3, I5}, { I2, I4, I5} is pruned.
Style and data structure of visual report
1. Visual report based on disease category distribution map
The rectangular tree diagram shows the number of the diseases, and the larger the area of the region, the more the diseases are. The main purpose of the rectangular tree diagram is to clearly see the overall situation in one diagram, and the size of the diagram is determined by the size of each component and has the function of grouping.
The specific drawing method comprises the following steps: firstly, calculating the total proportion of the diseases according to the number of the diseases of the third level, and then determining the area of each disease of the third level on a rectangle according to the total proportion number. Once the rectangular area of all third-level diseases is determined, the area of second-level diseases and the area of first-level diseases are also determined. FIG. 2 is an illustration of a disease class distribution profile.
Disease data is classified into three levels according to the catalog of icd 10. First order disease, presented with areas of different color. As illustrated in the illustration shown in fig. 2. Second and third level diseases, both represented by subdivided regions in the first level region. Clicking on any of the first level areas focuses on this level to specifically reveal its information. Such as respiratory disease after clicking, this category is presented with further information, as illustrated in the diagram of fig. 3.
2. Network relation graph of disease population
Association rule mining is an important topic in data mining, and as the name suggests, it is to discover the possible associations or connections between things from behind the data. For example, by examining what customers buy in a shopping mall, it is found that 30% of the customers buy both sheets and pillows, and 80% of the people who buy sheets buy pillows, which hides a relationship: the bed sheet and the pillow case are used for shopping, namely, a large number of customers can buy the bed sheet and the pillow case at the same time, so that the bed sheet and the pillow case can be placed in the same shopping area for shopping in a shopping mall, and the customers can conveniently shop.
In particular, the invention can find out the internal relation between the disease category and the attribute of the susceptible population by the association rule mining. The specific method comprises the following steps:
firstly, for each disease, a primary code DS1 of the disease category can be calculated, a group code PG of the crowd attribute of the patient can also be calculated, and a one-dimensional array is constructed and put in [ DS1, PG ];
then, scanning all disease records, and filling the input of the one-dimensional array of the first step into a new array to construct a high-dimensional array;
and thirdly, performing association rule mining calculation on the high-dimensional array to finally obtain the frequency weight value FP of the DS1 and PG different combined data. Since the relationship of high frequency is analyzed, 80 sets of results of the highest frequency are taken and filled into Gexf format data. Gexf is a special xml language used to describe complex network relationships, and generally specifies nodes (nodes) and then establishes relationships between nodes (edges). DS3, PG is filled as Node of Gexf, and its corresponding FP value is filled as Edge.
And finally, rendering the relational graph by using Gexf data. Where red is the disease category and deep blue is the demographic attribute. Wherein the demographic attributes are grouped by age group and gender. Disease categories, classified by the level one category of icd 10. After the weight of the relationship between a group of people and a disease category is calculated, a weight value FP is displayed on the chain. Higher weight values indicate that such a population is more susceptible to the disease. Since the FP value is the mined result according to the frequency relationship between the crowd property PG and the disease code DS1, a high FP value represents that the relationship between the crowd property PG and the disease is high-frequency occurrence in the data result. FIG. 4 is a diagram illustrating the relationship between intrinsic contact networks for disease categories and attributes of susceptible groups.