all transactions are scanned first, resulting in a 1-item set C1, and a frequent 1-item set is obtained by filtering out the item sets that do not meet the requirements according to the support requirements.

The following recursion operations are performed:

Style and data structure of visual report

1. Visual report based on disease category distribution map

The rectangular tree diagram shows the number of the diseases, and the larger the area of the region, the more the diseases are. The main purpose of the rectangular tree diagram is to clearly see the overall situation in one diagram, and the size of the diagram is determined by the size of each component and has the function of grouping.

The specific drawing method comprises the following steps: firstly, calculating the total proportion of the diseases according to the number of the diseases of the third level, and then determining the area of each disease of the third level on a rectangle according to the total proportion number. Once the rectangular area of all third-level diseases is determined, the area of second-level diseases and the area of first-level diseases are also determined. FIG. 2 is an illustration of a disease class distribution profile.

Disease data is classified into three levels according to the catalog of icd 10. First order disease, presented with areas of different color. As illustrated in the illustration shown in fig. 2. Second and third level diseases, both represented by subdivided regions in the first level region. Clicking on any of the first level areas focuses on this level to specifically reveal its information. Such as respiratory disease after clicking, this category is presented with further information, as illustrated in the diagram of fig. 3.

2. Network relation graph of disease population

Association rule mining is an important topic in data mining, and as the name suggests, it is to discover the possible associations or connections between things from behind the data. For example, by examining what customers buy in a shopping mall, it is found that 30% of the customers buy both sheets and pillows, and 80% of the people who buy sheets buy pillows, which hides a relationship: the bed sheet and the pillow case are used for shopping, namely, a large number of customers can buy the bed sheet and the pillow case at the same time, so that the bed sheet and the pillow case can be placed in the same shopping area for shopping in a shopping mall, and the customers can conveniently shop.

In particular, the invention can find out the internal relation between the disease category and the attribute of the susceptible population by the association rule mining. The specific method comprises the following steps:

And finally, rendering the relational graph by using Gexf data. Where red is the disease category and deep blue is the demographic attribute. Wherein the demographic attributes are grouped by age group and gender. Disease categories, classified by the level one category of icd 10. After the weight of the relationship between a group of people and a disease category is calculated, a weight value FP is displayed on the chain. Higher weight values indicate that such a population is more susceptible to the disease. Since the FP value is the mined result according to the frequency relationship between the crowd property PG and the disease code DS1, a high FP value represents that the relationship between the crowd property PG and the disease is high-frequency occurrence in the data result. FIG. 4 is a diagram illustrating the relationship between intrinsic contact networks for disease categories and attributes of susceptible groups.

Claims

1. A method for visualized report based on medical documents is characterized by comprising the following steps:

1) collecting data of medical documents;

2) dividing the data of the medical document into disease data and patient data;

3) analyzing the disease category data, adopting a clustering algorithm, and then presenting an analysis result in a disease category distribution map mode;

4) analyzing data of disease crowds, adopting a crowd attribute label algorithm and an association rule mining algorithm, and then presenting an analysis result by using a network relation graph method of the disease crowds;

the ICD10 medical directory is used as a tree structure directory for disease category data analysis, and then a clustering algorithm is performed on a directory tree for specific diseases;

the analysis of the data of the disease population is to use the data of the disease and the patient to carry out association rule mining, and the association rule mining is carried out by adopting an Apriori algorithm;

the method for analyzing the disease category data specifically comprises the following steps: obtaining a source of disease data based on the prescription on the medical document and the name of the disease in the proof of diagnosis; using ICD10 medical directory as tree structure directory, then making clustering algorithm on the specific disease directory tree, the specific clustering algorithm process is:

A) sorting out an ICD10 directory in a relational data mode, wherein the ICD10 directory is divided into three levels including DS1, DS2 and DS 3;

B) positioning to a specific disease record DS3 in a similarity searching method and an error correcting mode, wherein the specific searching method is to traverse the diseases on the document and calculate the editing distance between the document and the DS3 level diseases;

C) for DS3, record the number of patients;

D) summarizing all times of DS3 level on DS2 level; summarize all data for DS2 on DS1 level;

E) finally, the number of times and number of people of the disease are summarized according to a tree structure.

2. The method of claim 1, wherein the specific algorithm in B) is as follows:

B1) a length of str1 or str2 of 0 returns the length of another string:

B2) initializing a matrix d of (n +1) × (m +1) and letting the values of the first row and column grow from 0; scanning two character strings of n x m levels, if: str1[ i ] ═ str2[ j ], which is recorded as 0 with temp; otherwise temp is recorded as 1; then, the matrix d [ i, j ] is assigned with the minimum value of d [ i-1, j ] +1, d [ i, j-1] +1, d [ i-1, j-1] + temp;

B3) after scanning, returning the last value d [ n ] [ m ] of the matrix, namely the distance between the last value d [ n ] [ m ] and the matrix;

B4) comparing the distance with all DS3 levels, data with distance 0 or below a threshold, hits, and the disease on the document can be considered as the disease of DS 3.

3. The method of claim 1, wherein the data from the disease population is analyzed as follows:

sources of data include: the method comprises the steps of firstly, obtaining the tree structure of each disease by using the disease data analysis method, secondly, forming different user groups according to the age and the gender of patients in medical documents by using the crowd attribute labels of the patient data and the age, the gender and the medical insurance card number of the patients in the medical documents;

the association rule is made by using Apriori algorithm with the data of the disease and the patient.

4. The method of claim 1, wherein the disease category distribution map is used to represent the number of disease types, and the larger the area of the region, the more disease types are represented by a rectangular tree.

5. The method of claim 1, wherein the disease category distribution profile is specifically mapped as follows: firstly, calculating the total proportion of the morbidity according to the morbidity number of the third-level diseases, and then determining the area of each disease of the third level on a rectangle according to the total proportion number; the disease data is divided into three levels according to the catalog of ICD10, the first level disease is presented by areas with different colors; second and third level diseases, both represented by subdivided regions within the first level region; clicking on any of the first level areas focuses on this level to specifically reveal its information.

6. The method of claim 1, wherein the network relationship graph of the disease population is generated by the following method:

firstly, for each disease, calculating DS1 of disease category, calculating group code PG of the crowd attribute of the patient, and constructing a one-dimensional array to be put [ DS1, PG ];

thirdly, performing association rule mining calculation on the high-dimensional array to finally obtain frequency weight values FP of different combined data of DS1 and PG; using the analyzed high-frequency relation, taking 80 groups of results of the highest frequency, and filling the results into Gexf format data; filling DS3 and PG as nodes of Gexf, and filling the corresponding FP value as Edge;

finally, rendering the relation graph by using Gexf data; wherein the disease category and the crowd attribute are respectively represented by different colors; wherein the crowd attributes are grouped according to age group and gender; the disease categories are classified according to the primary catalog of ICD10, and after a weight of the relationship between a group of people and the disease categories is calculated, a weight value FP is displayed on the chain.

7. A system for visual reporting based on big data analysis of medical documents according to the method of any of claims 1 to 6, characterized in that it essentially comprises the following modules:

3) a visual report module: and respectively presenting the analysis result by using the disease category distribution map and the network relationship map of the disease population.