Movatterモバイル変換


[0]ホーム

URL:


CN118964363A - A data comprehensive analysis method, system, electronic device and storage medium - Google Patents

A data comprehensive analysis method, system, electronic device and storage medium
Download PDF

Info

Publication number
CN118964363A
CN118964363ACN202411451855.6ACN202411451855ACN118964363ACN 118964363 ACN118964363 ACN 118964363ACN 202411451855 ACN202411451855 ACN 202411451855ACN 118964363 ACN118964363 ACN 118964363A
Authority
CN
China
Prior art keywords
data
standard
occurrence
mining
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411451855.6A
Other languages
Chinese (zh)
Other versions
CN118964363B (en
Inventor
周光迪
高琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo SDIC Technology Development Co.,Ltd.
Original Assignee
Ningbo Ziwan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Ziwan Technology Co ltdfiledCriticalNingbo Ziwan Technology Co ltd
Priority to CN202411451855.6ApriorityCriticalpatent/CN118964363B/en
Publication of CN118964363ApublicationCriticalpatent/CN118964363A/en
Application grantedgrantedCritical
Publication of CN118964363BpublicationCriticalpatent/CN118964363B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention provides a data comprehensive analysis method, a system, electronic equipment and a storage medium, and relates to the technical field of data processing, wherein the method comprises the following steps: acquiring historical medical data of each data node, and preprocessing the historical medical data to obtain first standard data; recursively constructing the first standard data according to a preset construction method to obtain storage data and a storage data index; for each data node, constructing at least one data warehouse according to the stored data and the stored data index, and taking all the data warehouses as distributed data warehouses; acquiring actual data, and preprocessing the actual data to acquire second standard data; performing data mining on all stored data in the distributed data warehouse to obtain mining data; taking the mining data matched with the second standard data as matching data through a preset method; and determining a distinguishing point between the matching data and the actual data, and using the distinguishing point for deep mining of the previous case to improve the teaching effect.

Description

Data comprehensive analysis method, system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and a system for comprehensive analysis of data, an electronic device, and a storage medium.
Background
In the current big data age, the medical field is also faced with management and analysis challenges for massive data. Traditional statistical analysis methods of medical data are often limited to processing from a single data source, resulting in limited comprehensiveness and accuracy of the analysis results. In addition, with the popularity of online medical and intelligent systems, the speed and diversity of data generation has grown exponentially, making data processing and storage a major challenge.
Under the traditional teaching mode of teachers and students, teaching cases are often related to experience reserves of the students, so that cases are insufficient, most of existing data processing flows are concentrated on data cleaning and simple statistical analysis, and deep insight and personalized guidance of a case-based teaching process are insufficient for how to efficiently integrate historical medical data and real-time data. The lack of efficient data organization and fast polling mechanisms has difficulty supporting the fast response and flexible analysis requirements for large-scale data.
Disclosure of Invention
The invention solves the problem of how to analyze medical data so as to improve the teaching effect.
In order to solve the above problems, the present invention provides a method for comprehensively analyzing data, including:
acquiring historical medical data of each data node, and preprocessing the historical medical data to obtain first standard data;
recursively constructing the first standard data according to a preset construction method to obtain storage data and a storage data index;
For each data node, constructing at least one data warehouse according to the stored data and the stored data index, and taking all the data warehouses as distributed data warehouses;
acquiring actual data, and preprocessing the actual data to acquire second standard data;
Performing data mining on all the stored data in the distributed data warehouse to obtain mining data;
Taking the mining data matched with the second standard data as matching data through a preset method;
And determining a distinguishing point from the actual data according to the matching data.
Optionally, the acquiring the historical medical data of each data node, preprocessing the historical medical data, and obtaining the first standard data includes:
carrying out data cleaning and standardization processing on the historical medical data, and converting the historical medical data into a first data vector;
And taking the set of the first data vectors as the first standard data.
Optionally, recursively constructing the first standard data according to a preset construction method, and obtaining storage data and a storage data index;
randomly selecting a dimension and a segmentation point from the first standard data;
dividing the first data vector into two primary groups according to the dimension and the division point to form a first node, wherein the primary groups are direct child nodes of the first node;
randomly selecting a new dimension and a new segmentation point in each primary group under the first node, and further segmenting each primary group into two secondary groups;
Selecting a new dimension and a new segmentation point, segmenting the secondary group until a termination condition is met, completing the recursion construction, taking the first standard data after the recursion construction as the storage data, and taking a structure formed by all groups and all nodes obtained by the recursion construction as the storage data index, wherein the termination condition comprises: only one of the first data vectors in each n-level group, or, reaching a preset number of divisions.
Optionally, the data mining of all the stored data in the distributed data warehouse includes:
retrieving, from the stored data index, an approximate nearest neighbor to the second standard data in the distributed data warehouse, wherein if there are a plurality of the data warehouses and a similarity of the type of the historical medical data in the plurality of the data warehouses to the type of the second standard data exceeds a first similarity threshold, then searching in parallel for an approximate nearest neighbor to the second standard data in the plurality of the data warehouses;
And taking the approximate nearest neighbor as the mining data.
Optionally, the step of using the mined data matched with the second standard data as the matching data by a preset method includes:
Constructing a co-occurrence matrix by a co-occurrence analysis method, wherein the co-occurrence matrix is used for recording the co-occurrence times of the mining data and the second standard data;
Determining a first co-occurrence frequency and a first co-occurrence similarity of the mined data and the second standard data in the co-occurrence matrix;
And taking the mining data with the first co-occurrence frequency higher than a frequency threshold and the first co-occurrence similarity higher than a second similarity threshold as the matching data.
Optionally, after the co-occurrence matrix is constructed by co-occurrence analysis, the method further includes:
determining a second co-occurrence frequency and a second co-occurrence similarity of all the stored data and the second standard data in the co-occurrence matrix;
taking the stored data, for which the second co-occurrence frequency is higher than the frequency threshold and the second co-occurrence similarity is higher than the second similarity threshold and is not present in the matching data, as optimization data;
And optimizing the structure of the co-occurrence matrix based on the second co-occurrence frequency and the second co-occurrence similarity corresponding to the optimization data.
Optionally, the determining the distinguishing point from the actual data according to the matching data includes:
Analyzing the matching data to obtain multi-dimensional data contained in the matching data, wherein the multi-dimensional data comprises user codes, doctor codes, evaluation results and target completion degrees;
obtaining factor similarity among each dimension of the multi-dimensional data through a pearson correlation coefficient matrix;
taking the multi-dimensional data with the factor similarity larger than a third similarity threshold value as target data;
Determining a factor and a characteristic value of the target data by a minimum residual error method, and reserving the factor with the characteristic value larger than 1 as a target factor;
Rotating the target factors to obtain a factor load matrix;
Determining multidimensional data lower than a first preset score in the second standard data as low-score data according to the factor load matrix;
Extracting multidimensional data higher than a second preset score from the matching data as high-score data;
And optimizing the low score data according to the high score data until the score of the low score data is higher than a third preset score, wherein the first preset score is smaller than or equal to the third preset score, and the third preset score is smaller than or equal to the second preset score.
In a second aspect, the present invention also provides a data analysis-by-synthesis system, including:
the first acquisition module is used for acquiring the historical medical data of each data node, preprocessing the historical medical data and acquiring first standard data;
The first construction module is used for recursively constructing the first standard data according to a preset construction method to obtain storage data and a storage data index;
A second construction module, configured to construct, for each of the data nodes, at least one data warehouse according to the stored data and the stored data index, and use all the data warehouses as distributed data warehouses;
The second acquisition module is used for acquiring actual data, preprocessing the actual data and acquiring second standard data;
The mining module is used for carrying out data mining on all the stored data in the distributed data warehouse to obtain mining data;
The matching module is used for taking the mining data matched with the second standard data as matching data through a preset method;
and the optimizing module is used for determining a distinguishing point from the actual data according to the matching data.
In a third aspect, the present invention also provides an electronic device, including a memory and a processor;
the memory is used for storing a computer program;
The processor is configured to implement the data analysis-by-synthesis method as described above when executing the computer program.
In a fourth aspect, the present invention also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data analysis-by-synthesis method as described above.
Compared with the prior art, the method and the device have the advantages that the storage and the retrieval of the actual data are considered, the historical medical data are stored through the distributed storage system, the historical medical data are preprocessed, and the first standard data are obtained and serve as the basis for optimizing the medical data. For a high-data-volume distributed medical data storage scene, the first standard data is recursively constructed to obtain storage data and storage data indexes, and a plurality of data warehouses are constructed according to the number of data nodes, so that the advantage of distributed storage can be effectively utilized, and the retrieval efficiency is improved. The data of the data warehouse is subjected to data mining, the data is matched with the second standard data to obtain matched data, the actual data is optimized according to the matched data, the matched data matched with the second standard data can be quickly searched by utilizing a data index under the condition that the medical data are distributed in different storage places, the distinguishing point between the medical data and the actual data is found out according to the matched data, and the conventional data is mined and analyzed to improve the teaching quality of inexperienced users.
Drawings
FIG. 1 is a flow chart of a data analysis method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of the data analysis method according to the embodiment of the present invention after the refinement of step S100;
fig. 3 is a schematic flow chart of the data analysis-by-synthesis method according to the embodiment of the invention after the refinement of step S600.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments"; the term "optionally" means "alternative embodiments". Related definitions of other terms will be given in the description below. It should be noted that the concepts of "first", "second", etc. mentioned in this disclosure are only used to distinguish between different devices, modules or units, and are not intended to limit the order or interdependence of functions performed by these devices, modules or units.
It should be noted that references to "a" and "an" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
An embodiment of the present invention provides a method for comprehensively analyzing data, including:
Step S100, acquiring historical medical data of each data node, and preprocessing the historical medical data to obtain first standard data.
In an embodiment, the data nodes represent components responsible for storing actual data, e.g. when historical medical data is stored in a plurality of hospitals or departments, respectively, each as a data node. The history medical data is stored through the storage medium, wherein the history medical data comprises past teaching data which can be recorded, such as audio data, video data, text data and the like, and specifically data such as diagnostic videos, diagnostic audios, teaching courseware, teaching lectures, teaching material contents and the like. The teaching task and diagnosis history are used as storage indexes and used as history medical data. For example, all public data such as diagnostic videos, diagnostic audio, teaching courseware, teaching lectures, teaching material contents, etc. of 1 month 1 day 2000 are taken as one piece of historical medical data. The historical medical data is used as a database for teaching, and cases matched with the current actual data are obtained through case analysis and used for analyzing distinguishing points of the historical medical data and the current actual data, so that more case guidance is brought for teachers and students.
And performing pretreatment such as data screening, de-duplication, noise reduction and the like on the historical medical data to obtain first standard data. For example, deleting unrecognizable teaching courseware and teaching lecture data; deleting repeated audio data when the audio data and the video data have repeated contents; noise reduction processing is performed on noisy audio data, and noise reduction processing is performed on blurred image data or video data. The first standard data obtained by processing is the standardized data stored in the data warehouse.
Step S200, recursively constructing the first standard data according to a preset construction method, to obtain storage data and a storage data index.
In one embodiment, because of the wide source of the historical medical data, which may originate from various units, or other types of nodes, some nodes may not be convenient to disclose the historical medical data, and it is necessary to store the data and construct a data index for retrieval under each node. For the historical medical data stored in a distributed mode, the method has great significance in improving the retrieval efficiency, and constructing a proper data index can greatly improve the retrieval efficiency.
And step S300, constructing at least one data warehouse according to the stored data and the stored data index, and taking all the data warehouses as distributed data warehouses for each data node.
In one embodiment, some data nodes have larger data size, and the stored data needs to be stored separately to improve the retrieval efficiency, so that at least one data warehouse is constructed for each data node, and all data warehouses are used as distributed data warehouses.
Step S400, obtaining actual data, and preprocessing the actual data to obtain second standard data.
The method comprises the steps of obtaining actual data, carrying out standardized processing on the data to obtain second standard data consistent with the format of stored data, wherein the second standard data and the second standard data are consistent in format, and the second standard data are used for assisting in comparing differences between the actual data and historical medical data, and optimizing the actual data according to the differences so as to realize a teaching optimization method based on data analysis.
And S500, performing data mining on all the stored data in the distributed data warehouse to obtain mining data.
And analyzing and mining the stored data in the distributed data warehouse to determine the characteristics of the stored data. After the actual data is obtained, the data matched with the actual data can be quickly searched and determined according to the data index and the characteristics of the stored data.
And S600, taking the mining data matched with the second standard data as matching data through a preset method.
And step S700, determining a distinguishing point from the actual data according to the matching data.
After the matching data matched with the second standard data is determined from the distributed data warehouse, the deficiency of the actual data is optimized according to the advantages of the matching data, so that an effective optimization method is formed, and the teaching quality is improved. In the traditional teaching mode of teachers and students, teaching effects and teaching examples are related to experience storage of teachers, through the data processing method in the embodiment of the invention, historical medical data similar to the current case can be mined from a database, the historical medical data is determined to be matched data, and distinguishing points and similar points of the historical medical data and actual data are extracted for teaching, so that the cases are richer, and the teaching quality can be effectively improved.
Optionally, as shown in fig. 2, the obtaining the historical medical data of each data node, and preprocessing the historical medical data, obtaining the first standard data includes:
Step S110, data cleaning and normalization processing are carried out on the historical medical data, and the historical medical data are converted into first data vectors.
And step S120, taking the set of the first data vectors as the first standard data.
In one embodiment, first, published historical medical data is collected from sources such as information systems, online platforms, etc., including but not limited to basic information (e.g., age, personal base), etc. The data exists in structured (e.g., database tables) and unstructured (e.g., text, image material) forms. Cleaning the data, including: and (5) de-duplication: and identifying and removing the repeated records, and ensuring the uniqueness of each observation point. Missing value processing: appropriate policy processing is adopted for missing data, such as filling numerical data with averages, median, or rational inferring filling category type data from context logic. Outlier detection and processing: outliers are identified and processed by statistical analysis (e.g., box plot analysis), and may be selectively culled, corrected, or smoothed using more complex statistical methods (e.g., winsorization). Consistency check: ensuring data field format consistency, such as date format normalization, text data unification of case, removal of extraneous symbols, etc.
After the data are cleaned, the data are subjected to standardized treatment to eliminate dimension influence, so that the comparability between different features is realized, and the subsequent analysis is convenient. The usual standardized methods are:
Min-max normalization: all eigenvalues are mapped into the [0,1] interval. Z-Score normalization: the data were fitted to a standard normal distribution, i.e. each eigenvalue minus its mean divided by its standard deviation. For the above mentioned historical medical data, a suitable normalization method is selected according to the nature of the features, each feature being processed independently.
After normalization, each data record can be converted into a fixed length data vector, where each element represents a normalized value of a feature. For example, if the teaching data contains three features, the record of each piece of data would be transformed to form a data vector as [0.65,0.92,0.78 ].
And summarizing the data vectors of all individuals to form a set, namely 'first standard data'. The set is the basis of subsequent data analysis, can ensure that all data are processed in a unified scale and format, is convenient for algorithm understanding and processing, and reduces deviation caused by data inconsistency.
Optionally, recursively constructing the first standard data according to a preset construction method, and obtaining storage data and a storage data index;
randomly selecting a dimension and a segmentation point from the first standard data;
dividing the first data vector into two primary groups according to the dimension and the division point to form a first node, wherein the primary groups are direct child nodes of the first node;
randomly selecting a new dimension and a new segmentation point in each primary group under the first node, and further segmenting each primary group into two secondary groups;
Selecting a new dimension and a new segmentation point, segmenting the secondary group until a termination condition is met, completing the recursion construction, taking the first standard data after the recursion construction as the storage data, and taking a structure formed by all groups and all nodes obtained by the recursion construction as the storage data index, wherein the termination condition comprises: only one of the first data vectors in each n-level group, or, reaching a preset number of divisions.
Preset parameters of the recursively constructed decision tree are determined, including but not limited to maximum recursion depth (preset number of divisions), whether pruning is to be performed, and the like. These parameters are set according to the characteristics of the historical medical data and the analysis targets to avoid over-fitting or under-fitting phenomena. A feature dimension (attribute) and a particular value in that dimension are randomly selected from the first standard dataset as the partitioning point. This segmentation point will typically be chosen to be the value that maximizes the purity increase after segmentation of the dataset, but in this case is chosen directly at random for simplicity of illustration. The data vectors in the first standard dataset are divided into two subsets, i.e. two primary groups (where n=1), depending on the selected dimension and segmentation point. These two subsets will be the direct child nodes of the current node (i.e., the first node). The above process is repeated for each level group under the first node: a new dimension is randomly selected among the remaining unused features and a split point is selected in this dimension to continue splitting the primary group into two secondary groups (where n=2). The recursion continues by repeating the process of selecting dimensions and partitioning points in subsets within the current set each time until a termination condition is reached. The termination conditions include: each group contains only one data vector: this indicates that no effective segmentation can be performed anymore, as there is no more data to distinguish further. The preset dividing times are reached: this is a mechanism to prevent over-subdivision, ensure that the depth of the tree is within a reasonable range, and improve generalization capability. Once any subgroup satisfies the above-described termination condition, the subgroup is not subdivided into leaf nodes. The leaf nodes represent the final decision region in which the data vectors contained have similar characteristic properties. Throughout the recursive construction process, each node (including internal nodes and leaf nodes) should store its corresponding partition dimension, partition point information, and pointers (or indexes) to its child nodes. In addition, the leaf node also needs to store the index or direct data of all the first data vectors it contains for subsequent query and interpretation. The decision tree can be used for tasks such as classification, regression and the like, and helps analyze patterns and rules in historical medical data.
Optionally, recursively constructing the first standard data according to a preset construction method, and obtaining the storage data and the storage data index includes:
and for each data node, carrying out the recursion construction on the first standard data according to the preset construction method to obtain the stored data index corresponding to the data node.
In one embodiment, the data nodes represent each entity in the distributed data system that stores data, which is scattered across different servers or devices of the system. This distributed storage improves the accessibility of the data and the overall fault tolerance of the system.
In a distributed storage system, there are multiple data nodes, each with its own stored data index.
Optionally, the data mining of all the stored data in the distributed data warehouse includes:
retrieving, from the stored data index, an approximate nearest neighbor to the second standard data in the distributed data warehouse, wherein if there are a plurality of the data warehouses and a similarity of the type of the historical medical data in the plurality of the data warehouses to the type of the second standard data exceeds a first similarity threshold, then searching in parallel for an approximate nearest neighbor to the second standard data in the plurality of the data warehouses;
And taking the approximate nearest neighbor as the mining data.
Since the stored data has multiple dimensions (images, text vectors), for high-dimensional data that needs to be retrieved, an index is built on each node using a distributed approximate nearest neighbor algorithm. This step requires that the data be distributed to different nodes, each of which is responsible for index building and query processing of a portion of the data to improve efficiency and scalability. In an embodiment of the present invention, the first similarity threshold value indicates that the type between the historical medical data and the second standard data has a higher similarity, for example, semantics of the historical medical data and the second standard data are calculated, and when the similarity of the type represented by the semantics of the historical medical data and the second standard data exceeds the first similarity threshold value, approximate nearest neighbors of the second standard data in the plurality of data warehouses are searched in parallel. In other embodiments, the similarity may also be a similarity between the data types of the historical medical data and the second standard data. The first similarity threshold may be set according to data type or semantic type.
And (3) quickly searching the approximate nearest neighbor of the current data in the historical medical data by using the constructed stored data index, namely the approximate nearest neighbor of the second standard data in the stored data. This step may be parallelized when there are similar storage data in multiple storage nodes, each of which is responsible for processing a portion of the query request. For example, if the data warehouse of the first unit and the data warehouse of the second unit have storage data similar to the second standard data, the node corresponding to the first unit and the node corresponding to the second unit are searched in parallel, so that the searching efficiency is accelerated. And taking the approximate nearest neighbor as the mining data.
Optionally, as shown in fig. 3, the step of using the mined data that matches the second standard data as the matching data by a preset method includes:
And step S610, constructing a co-occurrence matrix through a co-occurrence analysis method, wherein the co-occurrence matrix is used for recording the co-occurrence times of the mining data and the second standard data.
Step S620, determining a first co-occurrence frequency and a first co-occurrence similarity of the mined data and the second standard data in the co-occurrence matrix.
Step S630, taking the mined data with the first co-occurrence frequency higher than a frequency threshold and the first co-occurrence similarity higher than a second similarity threshold as the matching data.
Co-occurrence matrices are constructed using co-occurrence analysis, with rows and columns representing different dimensions in the dataset, and each element in the matrix representing a co-occurrence frequency or number of times between corresponding terms. For example, if item A and item B occur N times in the same context, then the value of position (A, B) in the matrix is N. Sparse matrix storage may be employed to reduce spatial complexity when the scale of processing mined data is large.
And quantizing the relation between the mined data and the second standard data on the basis of the co-occurrence matrix. The first co-occurrence frequency is directly read from the matrix, i.e. the frequency at which the mined data and the second standard data co-occur. The first co-occurrence similarity may then require further computation, such as measuring the strength of association of the two by Jaccard similarity coefficients, cosine similarity, or pearson correlation coefficients. This step aims at a deep understanding of the association pattern between the different data items. Two key thresholds are set: a frequency threshold and a second similarity threshold. Each mined data in the co-occurrence matrix is evaluated and only if its co-occurrence frequency with the second standard data exceeds a frequency threshold and the co-occurrence similarity exceeds a second similarity threshold will the mined data be marked as matching data. This step ensures that the screened matching data is not only frequently co-occurring, but also has a high correlation in terms of semantics or behavioral patterns.
Optionally, after the co-occurrence matrix is constructed by co-occurrence analysis, the method further includes:
And determining a second co-occurrence frequency and a second co-occurrence similarity of all the stored data and the second standard data in the co-occurrence matrix.
And taking the stored data, which is not present in the matching data and has the second co-occurrence frequency higher than the frequency threshold and the second co-occurrence similarity higher than the second similarity threshold, as optimization data.
And optimizing the structure of the co-occurrence matrix based on the second co-occurrence frequency and the second co-occurrence similarity corresponding to the optimization data.
And retrieving the relation between all the stored data items and the second standard data again on the basis of the co-occurrence matrix. The second co-occurrence frequency refers to the number of times that the stored data co-occurs with the second standard data. For retrieving similar data that is missing based on the mined data match. Due to the screening settings of the mined data, the stored data may be too stringent to be ignored. These missed stored data are defined as optimized data, which may contain important potentially relevant information. The co-occurrence similarity represents a measure of the strength of relationship between a pair of words calculated based on the co-occurrence matrix, and when a certain data association between the stored data and the second standard data is high, the co-occurrence frequency and the co-occurrence similarity of the stored data and the second standard data are correspondingly high. Alternatively, co-occurrence similarity may be calculated based on existing similarity algorithms, such as point-to-point information, cosine similarity, and the like.
The process is designed as an iterative loop, the performance of the matrix is re-evaluated after each optimization, and the optimization strategy is adjusted according to feedback until a satisfactory analysis result is achieved.
Through the steps, the efficiency and the accuracy of subsequent data analysis can be improved by optimizing the co-occurrence matrix structure, and a solid foundation is laid for deep insight into complex connection behind data.
Optionally, the determining the distinguishing point from the actual data according to the matching data includes:
Analyzing the matching data to obtain multi-dimensional data contained in the matching data, wherein the multi-dimensional data comprises user codes, doctor codes, evaluation results and target completion degrees;
obtaining factor similarity among each dimension of the multi-dimensional data through a pearson correlation coefficient matrix;
taking the multi-dimensional data with the factor similarity larger than a third similarity threshold value as target data;
Determining a factor and a characteristic value of the target data by a minimum residual error method, and reserving the factor with the characteristic value larger than 1 as a target factor;
Rotating the target factors to obtain a factor load matrix;
Determining multidimensional data lower than a first preset score in the second standard data as low-score data according to the factor load matrix;
Extracting multidimensional data higher than a second preset score from the matching data as high-score data;
And optimizing the low score data according to the high score data until the score of the low score data is higher than a third preset score, wherein the first preset score is smaller than or equal to the third preset score, and the third preset score is smaller than or equal to the second preset score.
Wherein, the evaluation result comprises objective evaluation in the diagnosis process, which is filled in by human; the target completion level includes a prognostic evaluation of the diagnosed person.
In one embodiment, the correlation between dimensions in the matching data is analyzed according to a factor analysis method, and then the multidimensional data in the actual data is optimized according to a correlation pattern. Firstly analyzing matching data, carrying out factor similarity calculation on multidimensional data in the matching data according to a Pearson correlation coefficient matrix, taking multidimensional data with similarity larger than a third similarity threshold value as target data, wherein the multidimensional data which shows that the data has certain relevance with data of other dimensions, and multidimensional data which does not have relevance with data of other dimensions has little help to optimization. And determining factors and characteristic values of target data by a minimum residual method, and taking factors with the characteristic values larger than 1 as target factors for optimizing the number of the factors.
To improve the interpretability of the factor, the target factor is rotated. Rotation methods include orthogonal rotation (e.g., varimax) and Oblique rotation (e.g., oblique). The purpose of the rotation is to have each target data have a high load on as few factors as possible, thus making the factors easier to interpret. And evaluating the scores of the multidimensional data of the second standard data according to the factor load matrix, grading the multidimensional data in the matched data to obtain low-score data and high-score data respectively, and optimizing the low-score data through the high-score data to realize the optimization of the actual data according to the historical medical data. The first preset score, the second preset score and the third preset score respectively represent preset score thresholds for comparing with the absolute value (i.e. the score) of the factor load of the second standard data in the factor load matrix. The first preset score is used for screening low-score data in the second standard data, the second preset score is used for screening high-score data in the matched data, and the third standard data is used as the lowest score standard of the second standard data after optimization. For example, when the factor load absolute value of the second standard data is smaller than the first preset score, the second standard data is low score data, and the high score data in the matched data needs to be optimized to re-mine the second standard data with higher matching degree.
An embodiment of the present invention provides a data analysis-by-synthesis system, including:
the first acquisition module is used for acquiring the historical medical data of each data node, preprocessing the historical medical data and acquiring first standard data;
The first construction module is used for recursively constructing the first standard data according to a preset construction method to obtain storage data and a storage data index;
A second construction module, configured to construct, for each of the data nodes, at least one data warehouse according to the stored data and the stored data index, and use all the data warehouses as distributed data warehouses;
The second acquisition module is used for acquiring actual data, preprocessing the actual data and acquiring second standard data;
The mining module is used for carrying out data mining on all the stored data in the distributed data warehouse to obtain mining data;
The matching module is used for taking the mining data matched with the second standard data as matching data through a preset method;
and the optimizing module is used for determining a distinguishing point from the actual data according to the matching data.
Optionally, the data analysis by synthesis system further comprises an image input device, an audio input device and a display device.
An electronic device provided in another embodiment of the present invention includes a memory and a processor; the memory is used for storing a computer program; the processor is configured to implement the data analysis-by-synthesis method as described above when executing the computer program.
A further embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data analysis-by-synthesis method as described above.
An electronic device that can be a server or a client of the present invention will now be described, which is an example of a hardware device that can be applied to aspects of the present invention. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
The electronic device includes a computing unit that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) or a computer program loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device may also be stored. The computing unit, ROM and RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like. In the present application, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Although the invention is disclosed above, the scope of the invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications will fall within the scope of the invention.

Claims (10)

CN202411451855.6A2024-10-172024-10-17 A data comprehensive analysis method, system, electronic device and storage mediumActiveCN118964363B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411451855.6ACN118964363B (en)2024-10-172024-10-17 A data comprehensive analysis method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411451855.6ACN118964363B (en)2024-10-172024-10-17 A data comprehensive analysis method, system, electronic device and storage medium

Publications (2)

Publication NumberPublication Date
CN118964363Atrue CN118964363A (en)2024-11-15
CN118964363B CN118964363B (en)2025-03-21

Family

ID=93396601

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411451855.6AActiveCN118964363B (en)2024-10-172024-10-17 A data comprehensive analysis method, system, electronic device and storage medium

Country Status (1)

CountryLink
CN (1)CN118964363B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120183638A (en)*2025-05-212025-06-20四川大学华西医院 Surgery duration prediction and scheduling method based on machine learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110093293A1 (en)*2009-10-162011-04-21Infosys Technologies LimitedMethod and system for performing clinical data mining
CN109388637A (en)*2018-09-212019-02-26北京京东金融科技控股有限公司Data warehouse information processing method, device, system, medium
CN109830303A (en)*2019-02-012019-05-31上海众恒信息产业股份有限公司Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN112732871A (en)*2021-01-122021-04-30上海畅圣计算机科技有限公司Multi-label classification method for acquiring client intention label by robot
CN113111187A (en)*2021-04-072021-07-13河北冀联人力资源服务集团有限公司Method and system for mining employment platform comments
CN117235102A (en)*2023-09-152023-12-15以萨技术股份有限公司Population standard address matching method and system based on analytical data warehouse
CN117453764A (en)*2023-10-122024-01-26上海禾亘科技有限责任公司Data mining analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20110093293A1 (en)*2009-10-162011-04-21Infosys Technologies LimitedMethod and system for performing clinical data mining
CN109388637A (en)*2018-09-212019-02-26北京京东金融科技控股有限公司Data warehouse information processing method, device, system, medium
CN109830303A (en)*2019-02-012019-05-31上海众恒信息产业股份有限公司Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN112732871A (en)*2021-01-122021-04-30上海畅圣计算机科技有限公司Multi-label classification method for acquiring client intention label by robot
CN113111187A (en)*2021-04-072021-07-13河北冀联人力资源服务集团有限公司Method and system for mining employment platform comments
CN117235102A (en)*2023-09-152023-12-15以萨技术股份有限公司Population standard address matching method and system based on analytical data warehouse
CN117453764A (en)*2023-10-122024-01-26上海禾亘科技有限责任公司Data mining analysis method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
S.K. TSO等: "Data mining for detection of sensitive buses and influential buses in a power system subjected to disturbances", 《IEEE TRANSACTIONS ON POWER SYSTEMS》, vol. 19, no. 1, 19 February 2004 (2004-02-19), pages 563 - 568, XP011107177, DOI: 10.1109/TPWRS.2003.821479*
韩清娟: "多维光谱数据解析的化学计量学算法及应用研究", 《中国博士学位论文全文数据库 (工程科技Ⅰ辑)》, vol. 2009, no. 08, 15 August 2009 (2009-08-15), pages 014 - 159*

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120183638A (en)*2025-05-212025-06-20四川大学华西医院 Surgery duration prediction and scheduling method based on machine learning

Also Published As

Publication numberPublication date
CN118964363B (en)2025-03-21

Similar Documents

PublicationPublication DateTitle
Yang et al.Global model interpretation via recursive partitioning
Middlehurst et al.HIVE-COTE 2.0: a new meta ensemble for time series classification
US10019442B2 (en)Method and system for peer detection
CN109166615B (en)Medical CT image storage and retrieval method based on random forest hash
JP2013519138A (en) Join embedding for item association
Mueen et al.AWarp: Fast warping distance for sparse time series
CN118069791B (en)Intelligent electronic archive retrieval method and system
CN118964363B (en) A data comprehensive analysis method, system, electronic device and storage medium
CN110795613A (en)Commodity searching method, device and system and electronic equipment
CN110910991A (en)Medical automatic image processing system
Area et al.Analysis of Bayes, neural network and tree classifier of classification technique in data mining using WEKA
Yin et al.Long-term cross-session relevance feedback using virtual features
Malik et al.A comprehensive approach towards data preprocessing techniques & association rules
Yahia et al.A new approach for evaluation of data mining techniques
Bharathi et al.The significance of feature selection techniques in machine learning
Elezaj et al.Data-driven machine learning approach for predicting missing values in large data sets: A comparison study
Kumbhar et al.Web mining: A Synergic approach resorting to classifications and clustering
SathiyamoorthiIntroduction to machine learning and its implementation techniques
SharmaDeep Learning Data and Indexes in a Database
CN119808794B (en) A big data intelligent analysis method and system based on AI
US12204591B1 (en)Apparatus and a method for heuristic re-indexing of stochastic data to optimize data storage and retrieval efficiency
HegdeWeb Pages Clustering: A New Approach
Jain et al.An innovative approach for enhanced pattern extraction utilizing ant colony optimization
JourneyData to Data Science
Tiwari et al.Data Mining Principles, Process Model and Applications

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
TR01Transfer of patent right

Effective date of registration:20250902

Address after:Podium Building, Changchun Building, No. 157 Lingqiao Road, Haishu District, Ningbo City, Zhejiang Province 315012

Patentee after:Ningbo SDIC Technology Development Co.,Ltd.

Country or region after:China

Address before:315000 Zhejiang Province Ningbo City High-tech Innovation Zone West Area of Ningbo New Materials Innovation Center Building 1 No. 1 4-6

Patentee before:Ningbo Ziwan Technology Co.,Ltd.

Country or region before:China

TR01Transfer of patent right

[8]ページ先頭

©2009-2025 Movatter.jp