
技术领域technical field
本发明涉及数据库技术领域,尤其是一种用于数字资源使用建设的数据分析方法。The invention relates to the technical field of databases, in particular to a data analysis method for use and construction of digital resources.
背景技术Background technique
数字资源库是一种在各行业被广泛使用的数据库。为了保证数据资源的实时性,需要定期对数字资源库进行更新。由于每次更新过程的数据量较大,导致更新过程速度较慢,影响到了数字资源库的使用便利性。Digital Repository is a database that is widely used in various industries. In order to ensure the real-time nature of data resources, the digital resource database needs to be updated regularly. Due to the large amount of data in each update process, the update process is slow, which affects the convenience of the digital resource library.
发明内容SUMMARY OF THE INVENTION
本发明要解决的技术问题是提供一种用于数字资源使用建设的数据分析方法,能够解决现有技术的不足,提高数字资源库的数据更新速度。The technical problem to be solved by the present invention is to provide a data analysis method for use and construction of digital resources, which can solve the deficiencies of the prior art and improve the data update speed of the digital resource database.
为解决上述技术问题,本发明所采取的技术方案如下。In order to solve the above technical problems, the technical solutions adopted by the present invention are as follows.
一种用于数字资源使用建设的数据分析方法,包括以下步骤:A data analysis method for use and construction of digital resources, comprising the following steps:
A、对待用数据按照内容进行分类,然后对每类数据进行数据清洗;A. Classify the data to be used according to the content, and then perform data cleaning on each type of data;
B、建立经过步骤A处理后的数据的索引表,并通过添加外键将索引表与资源库索引表进行整合;B. Establish an index table of the data processed in step A, and integrate the index table with the resource library index table by adding a foreign key;
C、对整合后的资源库进行模拟运算,根据运算结果对步骤B整合后的资源库索引表进行更新。C. Perform a simulation operation on the integrated resource library, and update the integrated resource library index table in step B according to the operation result.
作为优选,步骤A中,进行数据清洗包括以下步骤,Preferably, in step A, performing data cleaning includes the following steps:
A1、提取每类数据的敏感因子;A1. Extract the sensitive factors of each type of data;
A2、以相似度对敏感因子进行聚类分组,然后根据每组敏感因子的数量对同组的敏感因子赋予一个相同的优先级;A2. Clustering and grouping sensitive factors by similarity, and then assigning the same priority to sensitive factors in the same group according to the number of sensitive factors in each group;
A3、删除不包含敏感因子的数据;A3. Delete data that does not contain sensitive factors;
A4、对于包含敏感因子的数据,根据其包含的最高优先级的敏感因子进行分组;A4. For data containing sensitive factors, group them according to the sensitive factors with the highest priority;
A5、对每组数据中的重复数据进行删除;A5. Delete the duplicate data in each group of data;
A6、对剩余数据进行一次模拟运算,然后对剩余数据中的非最高优先级敏感因子进行互换,再进行一次模拟运算,对比两次模拟运算结果,将敏感因子互换前后模拟运算结果偏差小于设定阈值的数据进行合并;A6. Perform a simulation operation on the remaining data, then exchange the non-highest priority sensitive factors in the remaining data, perform another simulation operation, and compare the results of the two simulation operations. The deviation of the simulation results before and after the exchange of the sensitive factors is less than The data of the set threshold is merged;
A7、重复步骤A6,直至没有符合合并条件的数据,结束。A7. Repeat step A6 until there is no data that meets the merging conditions, and end.
作为优选,步骤A1中,提取每类数据的敏感因子包括以下步骤,Preferably, in step A1, extracting the sensitive factors of each type of data includes the following steps:
A11、对数据内容进行标记,一个数据的标记数量大于等于2个;A11. Mark the data content, the number of markers for one data is greater than or equal to 2;
A12、对数据的标记位置内容进行随机替换,使用测试函数对替换前后的数据进行测试运算,计算两次运算结果的偏差度;A12. Randomly replace the content of the marked position of the data, use the test function to perform test operations on the data before and after the replacement, and calculate the deviation of the results of the two operations;
A13、重复步骤A12,每次执行步骤A12之前对数据的标记位置进行更换,直至偏差度超过预设阈值或重复次数达到预设次数,结束测试运算,选择偏差度最大的标记内容作为敏感因子。A13. Repeat step A12, changing the marked position of the data before each execution of step A12, until the deviation exceeds the preset threshold or the number of repetitions reaches the preset number of times, end the test operation, and select the marked content with the largest deviation as the sensitivity factor.
作为优选,步骤B中,建立经过步骤A处理后的数据的索引表包括以下步骤,Preferably, in step B, establishing an index table of the data processed in step A includes the following steps:
B11、建立每个数据所包含敏感因子的敏感因子集合,建立敏感因子集合与数据之间的关联函数;B11. Establish a sensitive factor set of the sensitive factors contained in each data, and establish an association function between the sensitive factor set and the data;
B12、建立两级索引表,第一级索引表的对象为关联函数,采用分组方式存储,将关联函数根据相似度进行分组,第二级索引表的对象为敏感因子集合,采用队列方式存储;B12. Establish a two-level index table. The object of the first-level index table is an association function, which is stored in a grouping mode, and the association function is grouped according to the similarity. The object of the second-level index table is a set of sensitive factors, which is stored in a queue mode;
B13、检索数据时,首先通过第二级索引表查找与目标数据的敏感因子集合相同和/或相似的敏感因子集合,然后通过第一级索引表查找与第二级索引表中敏感因子相关的关联函数,最后通过查找到的关联函数所在分组中的关联函数查找目标数据。B13. When retrieving data, first look for a set of sensitive factors that is the same and/or similar to the set of sensitive factors of the target data through the second-level index table, and then through the first-level index table to look up the sensitive factors related to the sensitive factors in the second-level index table association function, and finally find the target data through the association function in the group where the found association function is located.
作为优选,步骤C中,对步骤B整合后的资源库索引表进行更新包括以下步骤,Preferably, in step C, updating the resource library index table integrated in step B includes the following steps:
C1、根据模拟运算结果更新敏感因子集合;C1. Update the sensitive factor set according to the simulation operation result;
C2、根据更新后的敏感因子集合对第二级索引表进行更新。C2. Update the second-level index table according to the updated sensitive factor set.
采用上述技术方案所带来的有益效果在于:本发明通过提取敏感因子,使用敏感因子作为数据清洗的限制参数,有效降低了数据清洗过程对数据的检验运算量,同时提高了数据清洗的准确度。与此同时,在建立索引表的过程中,通过建立包含关联函数和敏感因子的两级索引结构,可以提高数据检索效率。另外,在每次对索引表进行更新时只需要对包含敏感因子集合的第二级索引表进行更新即可,更新运算量更低。The beneficial effects brought about by the above technical solutions are: the present invention effectively reduces the amount of data inspection and computation in the data cleaning process by extracting the sensitive factor and using the sensitive factor as the limiting parameter for data cleaning, and simultaneously improves the accuracy of data cleaning . At the same time, in the process of establishing the index table, the data retrieval efficiency can be improved by establishing a two-level index structure including correlation functions and sensitive factors. In addition, each time the index table is updated, only the second-level index table including the sensitive factor set needs to be updated, and the amount of update operation is lower.
附图说明Description of drawings
图1是本发明一个具体实施方式的流程图。FIG. 1 is a flow chart of a specific embodiment of the present invention.
具体实施方式Detailed ways
参照图1,本发明的一个具体实施方式包括以下步骤:1, a specific embodiment of the present invention includes the following steps:
A、对待用数据按照内容进行分类,然后对每类数据进行数据清洗;A. Classify the data to be used according to the content, and then perform data cleaning on each type of data;
B、建立经过步骤A处理后的数据的索引表,并通过添加外键将索引表与资源库索引表进行整合;B. Establish an index table of the data processed in step A, and integrate the index table with the resource library index table by adding a foreign key;
C、对整合后的资源库进行模拟运算,根据运算结果对步骤B整合后的资源库索引表进行更新。C. Perform a simulation operation on the integrated resource library, and update the integrated resource library index table in step B according to the operation result.
步骤A中,进行数据清洗包括以下步骤,In step A, carrying out data cleaning includes the following steps,
A1、提取每类数据的敏感因子;A1. Extract the sensitive factors of each type of data;
A2、以相似度对敏感因子进行聚类分组,然后根据每组敏感因子的数量对同组的敏感因子赋予一个相同的优先级;A2. Clustering and grouping sensitive factors by similarity, and then assigning the same priority to sensitive factors in the same group according to the number of sensitive factors in each group;
A3、删除不包含敏感因子的数据;A3. Delete data that does not contain sensitive factors;
A4、对于包含敏感因子的数据,根据其包含的最高优先级的敏感因子进行分组;A4. For data containing sensitive factors, group them according to the sensitive factors with the highest priority;
A5、对每组数据中的重复数据进行删除;A5. Delete the duplicate data in each group of data;
A6、对剩余数据进行一次模拟运算,然后对剩余数据中的非最高优先级敏感因子进行互换,再进行一次模拟运算,对比两次模拟运算结果,将敏感因子互换前后模拟运算结果偏差小于设定阈值的数据进行合并;A6. Perform a simulation operation on the remaining data, then exchange the non-highest priority sensitive factors in the remaining data, perform another simulation operation, and compare the results of the two simulation operations. The deviation of the simulation results before and after the exchange of the sensitive factors is less than The data of the set threshold is merged;
A7、重复步骤A6,直至没有符合合并条件的数据,结束。A7. Repeat step A6 until there is no data that meets the merging conditions, and end.
步骤A1中,提取每类数据的敏感因子包括以下步骤,In step A1, extracting the sensitive factors of each type of data includes the following steps:
A11、对数据内容进行标记,一个数据的标记数量大于等于2个;A11. Mark the data content, the number of markers for one data is greater than or equal to 2;
A12、对数据的标记位置内容进行随机替换,使用测试函数对替换前后的数据进行测试运算,计算两次运算结果的偏差度;A12. Randomly replace the content of the marked position of the data, use the test function to perform test operations on the data before and after the replacement, and calculate the deviation of the results of the two operations;
A13、重复步骤A12,每次执行步骤A12之前对数据的标记位置进行更换,直至偏差度超过预设阈值或重复次数达到预设次数,结束测试运算,选择偏差度最大的标记内容作为敏感因子。A13. Repeat step A12, changing the marked position of the data before each execution of step A12, until the deviation exceeds the preset threshold or the number of repetitions reaches the preset number of times, end the test operation, and select the marked content with the largest deviation as the sensitivity factor.
步骤B中,建立经过步骤A处理后的数据的索引表包括以下步骤,In step B, establishing the index table of the data processed by step A includes the following steps,
B11、建立每个数据所包含敏感因子的敏感因子集合,建立敏感因子集合与数据之间的关联函数;B11. Establish a sensitive factor set of the sensitive factors contained in each data, and establish an association function between the sensitive factor set and the data;
B12、建立两级索引表,第一级索引表的对象为关联函数,采用分组方式存储,将关联函数根据相似度进行分组,第二级索引表的对象为敏感因子集合,采用队列方式存储;B12. Establish a two-level index table. The object of the first-level index table is an association function, which is stored in a grouping mode, and the association function is grouped according to the similarity. The object of the second-level index table is a set of sensitive factors, which is stored in a queue mode;
B13、检索数据时,首先通过第二级索引表查找与目标数据的敏感因子集合相同和/或相似的敏感因子集合,然后通过第一级索引表查找与第二级索引表中敏感因子相关的关联函数,最后通过查找到的关联函数所在分组中的关联函数查找目标数据。B13. When retrieving data, first look for a set of sensitive factors that is the same and/or similar to the set of sensitive factors of the target data through the second-level index table, and then through the first-level index table to look up the sensitive factors related to the sensitive factors in the second-level index table association function, and finally find the target data through the association function in the group where the found association function is located.
步骤C中,对步骤B整合后的资源库索引表进行更新包括以下步骤,C1、根据模拟运算结果更新敏感因子集合;In step C, updating the resource library index table integrated in step B includes the following steps, C1, updating the set of sensitive factors according to the simulation operation result;
C2、根据更新后的敏感因子集合对第二级索引表进行更新。C2. Update the second-level index table according to the updated sensitive factor set.
本发明通过改进数据清洗的过程,有效提高了数字资源库的数据更新速度。The invention effectively improves the data update speed of the digital resource library by improving the data cleaning process.
在本发明的描述中,需要理解的是,术语“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the terms "portrait", "horizontal", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientation or positional relationship indicated by "horizontal", "top", "bottom", "inner", "outer", etc. is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention, rather than indicating or It is implied that the device or element referred to must have a particular orientation, be constructed and operate in a particular orientation, and therefore should not be construed as limiting the invention.
以上显示和描述了本发明的基本原理和主要特征和本发明的优点。本行业的技术人员应该了解,本发明不受上述实施例的限制,上述实施例和说明书中描述的只是说明本发明的原理,在不脱离本发明精神和范围的前提下,本发明还会有各种变化和改进,这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The basic principles and main features of the present invention and the advantages of the present invention have been shown and described above. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments, and the descriptions in the above-mentioned embodiments and the description are only to illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will have Various changes and modifications fall within the scope of the claimed invention. The claimed scope of the present invention is defined by the appended claims and their equivalents.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111496809.4ACN114328481B (en) | 2021-12-08 | 2021-12-08 | A data analysis method for digital resource usage construction |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111496809.4ACN114328481B (en) | 2021-12-08 | 2021-12-08 | A data analysis method for digital resource usage construction |
| Publication Number | Publication Date |
|---|---|
| CN114328481Atrue CN114328481A (en) | 2022-04-12 |
| CN114328481B CN114328481B (en) | 2025-09-23 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111496809.4AActiveCN114328481B (en) | 2021-12-08 | 2021-12-08 | A data analysis method for digital resource usage construction |
| Country | Link |
|---|---|
| CN (1) | CN114328481B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104714252A (en)* | 2014-03-04 | 2015-06-17 | 中国石油化工股份有限公司 | Method for analyzing fluid factor sensibility |
| CN108287835A (en)* | 2017-01-09 | 2018-07-17 | 腾讯科技(深圳)有限公司 | A kind of data clearing method and device |
| CN109063222A (en)* | 2018-11-04 | 2018-12-21 | 吉铁磊 | A kind of self-adapting data searching method based on big data |
| CN109241023A (en)* | 2018-09-21 | 2019-01-18 | 郑州云海信息技术有限公司 | Distributed memory system date storage method, device, system and storage medium |
| CN110427655A (en)* | 2019-07-09 | 2019-11-08 | 中国地质大学(武汉) | A kind of extracting method for the sensitiveness that comes down |
| CN111048190A (en)* | 2019-11-29 | 2020-04-21 | 挂号网(杭州)科技有限公司 | DRG grouping method based on artificial intelligence |
| US20200184028A1 (en)* | 2018-12-10 | 2020-06-11 | Institute For Information Industry | Optimization method and module thereof based on feature extraction and machine learning |
| WO2021143016A1 (en)* | 2020-01-15 | 2021-07-22 | 平安科技(深圳)有限公司 | Approximate data processing method and apparatus, medium and electronic device |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104714252A (en)* | 2014-03-04 | 2015-06-17 | 中国石油化工股份有限公司 | Method for analyzing fluid factor sensibility |
| CN108287835A (en)* | 2017-01-09 | 2018-07-17 | 腾讯科技(深圳)有限公司 | A kind of data clearing method and device |
| CN109241023A (en)* | 2018-09-21 | 2019-01-18 | 郑州云海信息技术有限公司 | Distributed memory system date storage method, device, system and storage medium |
| CN109063222A (en)* | 2018-11-04 | 2018-12-21 | 吉铁磊 | A kind of self-adapting data searching method based on big data |
| US20200184028A1 (en)* | 2018-12-10 | 2020-06-11 | Institute For Information Industry | Optimization method and module thereof based on feature extraction and machine learning |
| CN110427655A (en)* | 2019-07-09 | 2019-11-08 | 中国地质大学(武汉) | A kind of extracting method for the sensitiveness that comes down |
| CN111048190A (en)* | 2019-11-29 | 2020-04-21 | 挂号网(杭州)科技有限公司 | DRG grouping method based on artificial intelligence |
| WO2021143016A1 (en)* | 2020-01-15 | 2021-07-22 | 平安科技(深圳)有限公司 | Approximate data processing method and apparatus, medium and electronic device |
| Title |
|---|
| 徐松等: "改进的敏感性分析方法在桩端注浆效果评价中的应用", 施工技术, no. 10, 25 May 2015 (2015-05-25)* |
| Publication number | Publication date |
|---|---|
| CN114328481B (en) | 2025-09-23 |
| Publication | Publication Date | Title |
|---|---|---|
| CN106294762B (en) | Entity identification method based on learning | |
| CN104199931B (en) | A kind of consistent semantic extracting method of trademark image and trade-mark searching method | |
| WO2017210949A1 (en) | Cross-media retrieval method | |
| CN104573130B (en) | The entity resolution method and device calculated based on colony | |
| CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
| CN108959395B (en) | Multi-source heterogeneous big data oriented hierarchical reduction combined cleaning method | |
| CN104281674A (en) | Adaptive clustering method and adaptive clustering system on basis of clustering coefficients | |
| CN106484915B (en) | A method and system for cleaning massive data | |
| CN110493221A (en) | A kind of network anomaly detection method based on the profile that clusters | |
| CN109949185A (en) | Judicial case discrimination system and method based on event tree analysis | |
| CN112905380A (en) | System anomaly detection method based on automatic monitoring log | |
| CN104679887A (en) | Large-scale image data similarity searching method based on EMD (earth mover's distance) | |
| CN115146062A (en) | Intelligent event analysis method and system integrating expert recommendation and text clustering | |
| CN114969467A (en) | Data analysis and classification method and device, computer equipment and storage medium | |
| CN116451675A (en) | A detection and optimization method for similar duplicate records based on the density clustering algorithm DBSCAN algorithm | |
| CN117252183B (en) | Semantic-based multi-source table automatic matching method, device and storage medium | |
| CN109492098B (en) | Target language material library construction method based on active learning and semantic density | |
| WO2023272855A1 (en) | Virus gene classification method and apparatus, electronic device, and computer-readable storage medium | |
| CN105631465A (en) | Density peak-based high-efficiency hierarchical clustering method | |
| CN113205124B (en) | Clustering method, system and storage medium based on density peak value under high-dimensional real scene | |
| CN110008205A (en) | A method for cleaning redundant data of monitoring system | |
| CN105760478A (en) | Large-scale distributed data clustering method based on machine learning | |
| CN114328481A (en) | A data analysis method for use and construction of digital resources | |
| CN104268571B (en) | A kind of Infrared Multi-Target dividing method based on minimum tree cluster | |
| CN106557668A (en) | DNA sequence dna similar test method based on LF entropys |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |