
技术领域technical field
本发明涉及数据处理技术领域,特别涉及一种数据质量检查规则匹配方法、存储介质及系统。The invention relates to the technical field of data processing, in particular to a data quality inspection rule matching method, storage medium and system.
背景技术Background technique
电网系统在运行时会产生大量的业务数据,这些业务数据能反映电网系统的运行状况,需采集后存储在业务系统中。目前通常会采用数据质量检查规则对业务系统中的业务数据进行质量检查,若有业务数据质量检查结果为异常,则工作人员需对异常的业务数据所对应的电网运行业务进行监控。The power grid system will generate a large amount of business data during operation. These business data can reflect the operation status of the power grid system and need to be collected and stored in the business system. At present, the data quality inspection rules are usually used to check the quality of the business data in the business system. If the result of the business data quality inspection is abnormal, the staff needs to monitor the power grid operation business corresponding to the abnormal business data.
目前用于质量检查的多个数据质量检查规则通常是预先设定好的,这样为对业务数据进行质量检查而对数据质量检查规则进行选择的过程中,需先计算用于描述业务数据的字段元数据与数据质量检查规则之间的关联度,若该关联度达标,则令字段元数据与数据质量检查规则相匹配以进行质量检查。若有某个字段元数据,其与各个数据质量检查规则之间的关联度都没有达标,则该字段元数据就匹配不到数据质量检查规则,这样就不能对该字段元数据所描述的业务数据进行质量检查。Multiple data quality inspection rules currently used for quality inspection are usually pre-set, so that in the process of selecting data quality inspection rules for quality inspection of business data, it is necessary to first calculate the fields used to describe business data The degree of association between metadata and data quality inspection rules. If the association degree meets the standard, match the field metadata with data quality inspection rules for quality inspection. If there is a field metadata whose degree of correlation with each data quality inspection rule is not up to standard, the field metadata cannot match the data quality inspection rules, so that the business described by the field metadata cannot Data quality checks.
发明内容Contents of the invention
本发明要解决的技术问题是如何改善字段元数据匹配不到数据质量检查规则的情况。The technical problem to be solved by the present invention is how to improve the situation that field metadata cannot match data quality inspection rules.
为解决上述技术问题,本发明提供一种数据质量检查规则匹配方法,包括如下步骤:In order to solve the above technical problems, the present invention provides a data quality inspection rule matching method, comprising the following steps:
A.从业务系统中采集用于描述业务数据的多个字段元数据及各个字段元数据的名称信息、来源信息和数据类型信息;A. Collect multiple field metadata used to describe business data and the name information, source information and data type information of each field metadata from the business system;
B.获取预设的多个数据质量检查规则及各个数据质量检查规则所包含的字段名称信息、字段来源信息和条件参数;B. Obtain multiple preset data quality inspection rules and field name information, field source information and condition parameters contained in each data quality inspection rule;
C.根据各个字段元数据的名称信息、来源信息和各个数据质量检查规则所包含的字段名称信息、字段来源信息,判断各个字段元数据与各个数据质量检查规则之间的关联度是否达标;C. According to the name information and source information of each field metadata and the field name information and field source information contained in each data quality inspection rule, determine whether the correlation between each field metadata and each data quality inspection rule meets the standard;
D.令关联度达标的字段元数据与数据质量检查规则相匹配;D. Make the field metadata that meets the relevant degree match the data quality inspection rules;
E.识别所述多个字段元数据当中已匹配数据质量检查规则的候选字段元数据和未匹配数据质量检查规则的待匹配字段元数据;E. identifying the candidate field metadata that has matched the data quality inspection rule and the unmatched field metadata that has not matched the data quality inspection rule among the plurality of field metadata;
F.对每个待匹配字段元数据,执行下述步骤F1、F2、F3、F4:F. For each field metadata to be matched, perform the following steps F1, F2, F3, F4:
——F1.判断是否存在与待匹配字段元数据的文本相似度大于预设阈值且数据类型一致的候选字段元数据,若存在则展示该候选字段元数据及其所匹配的数据质量检查规则供用户选择;——F1. Determine whether there is any candidate field metadata whose textual similarity with the metadata of the field to be matched is greater than the preset threshold and whose data type is consistent. If so, display the candidate field metadata and its matched data quality inspection rules for user selection;
——F2.获取用户选择的候选字段元数据及其所匹配的数据质量检查规则,将该数据质量检查规则所包含的字段名称信息和字段来源信息替换成待匹配字段元数据的名称信息和来源信息;——F2. Obtain the metadata of the candidate field selected by the user and the matching data quality inspection rule, and replace the field name information and field source information contained in the data quality inspection rule with the name information and source of the field metadata to be matched information;
——F3.获取用户输入的新条件参数,将用户选择的数据质量检查规则所包含的条件参数替换成用户输入的新条件参数,得到新数据质量检查规则;——F3. Get the new condition parameter input by the user, replace the condition parameter included in the data quality inspection rule selected by the user with the new condition parameter input by the user, and obtain the new data quality inspection rule;
——F4.令待匹配字段元数据与所述新数据质量检查规则相匹配。- F4. Match the metadata of the field to be matched with the new data quality inspection rule.
优选地,所述步骤D中,若有数据质量检查规则,其字段名称信息与本字段元数据的名称信息之间的文本相似度达到第一预设值,且其字段来源信息与本字段元数据的来源信息之间的文本相似度达到第二预设值,则该数据质量检查规则与本字段元数据之间的关联度达标。Preferably, in the step D, if there is a data quality inspection rule, the text similarity between its field name information and the name information of this field metadata reaches the first preset value, and its field source information is consistent with this field metadata If the text similarity between the source information of the data reaches the second preset value, then the correlation between the data quality inspection rule and the metadata in this field reaches the standard.
优选地,所述步骤F1中,先根据待匹配字段元数据的名称信息与各个候选字段元数据的名称信息,计算待匹配字段元数据与各个候选字段元数据的文本相似度,判断是否存在与待匹配字段元数据的文本相似度大于预设阈值的候选字段元数据,若存在则再根据待匹配字段元数据的数据类型信息与该候选字段元数据的数据类型信息,对比判断待匹配字段元数据与该候选字段元数据是否数据类型一致。Preferably, in the step F1, first, according to the name information of the metadata of the field to be matched and the name information of the metadata of each candidate field, the text similarity between the metadata of the field to be matched and the metadata of each candidate field is calculated, and it is judged whether there is a If the textual similarity of the metadata of the field to be matched is greater than the preset threshold, the metadata of the field to be matched is compared and judged according to the data type information of the metadata of the field to be matched and the data type of the metadata of the candidate field. Whether the data type is consistent with the metadata of the candidate field.
优选地,所述步骤F1中,若与待匹配字段元数据的文本相似度大于预设阈值且数据类型一致的候选字段元数据有多个,则按照文本相似度由大至小对这多个候选字段元数据进行排序展示供用户选择。Preferably, in the step F1, if there are multiple candidate field metadata whose textual similarity with the metadata of the field to be matched is greater than a preset threshold and whose data type is the same, the multiple Candidate field metadata is sorted and displayed for users to choose.
优选地,所述步骤F1中,选择文本相似度排在预定名次前的候选字段元数据进行展示。Preferably, in the step F1, the metadata of the candidate fields whose text similarity ranks before the predetermined ranking are selected for display.
优选地,所述步骤F2中,先利用SQL引擎将该数据质量检查规则分解成包括字段名称信息的select子句、包括字段来源信息的from子句和包括条件参数的where子句,再将select子句中的字段名称信息替换成待匹配字段元数据的名称信息,将from子句中的字段来源信息替换成待匹配字段元数据的来源信息;所述步骤F3中,将where子句中的条件参数替换成用户输入的新条件参数,然后将替换后的select子句、from子句和where子句组合得到新数据质量检查规则。Preferably, in the step F2, the SQL engine is used to decompose the data quality inspection rule into a select clause including field name information, a from clause including field source information, and a where clause including condition parameters, and then select The field name information in the clause is replaced with the name information of the field metadata to be matched, and the field source information in the from clause is replaced with the source information of the field metadata to be matched; in the step F3, the The condition parameter is replaced with a new condition parameter input by the user, and then the replaced select clause, from clause and where clause are combined to obtain a new data quality inspection rule.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的数据质量检查规则匹配方法中的步骤。The present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned data quality inspection rule matching method are realized.
本发明还提供一种数据质量检查规则匹配系统,包括相互连接的计算机可读存储介质及处理器,计算机可读存储介质如上所述。The present invention also provides a data quality inspection rule matching system, including a computer-readable storage medium and a processor connected to each other, and the computer-readable storage medium is as described above.
本发明具有以下有益效果:对未匹配数据质量检查规则的待匹配字段元数据,判断是否存在与待匹配字段元数据的文本相似度大于预设阈值且数据类型一致的候选字段元数据,若存在则意味着该候选字段元数据所匹配的数据质量检查规则的模板适合于该待匹配字段元数据,故展示该候选字段元数据及其所匹配的数据质量检查规则供用户选择,然后获取用户选择的候选字段元数据及其所匹配的数据质量检查规则,将该数据质量检查规则所包含的字段名称信息和字段来源信息替换成待匹配字段元数据的名称信息和来源信息,并获取用户输入的新条件参数,将用户选择的数据质量检查规则所包含的条件参数替换成用户输入的新条件参数,即在原数据质量检查规则的模板基础上,按照待匹配字段元数据的名称信息、来源信息及用户输入的新条件参数来改变数据质量检查规则的字段名称信息、字段来源信息及条件参数,得到新数据质量检查规则,由于新数据质量检查规则的字段名称信息、字段来源信息及条件参数由待匹配字段元数据的数据信息改变而来,这样待匹配字段元数据与新数据质量检查规则之间的关联度会达标,故令待匹配字段元数据与新数据质量检查规则相匹配,就可利用新数据质量检查规则对待匹配字段元数据所描述的业务数据进行质量检查。The present invention has the following beneficial effects: for the field metadata to be matched that does not match the data quality inspection rule, it is judged whether there is candidate field metadata whose text similarity with the field metadata to be matched is greater than a preset threshold and the data type is consistent, if there is It means that the template of the data quality inspection rule matched by the metadata of the candidate field is suitable for the metadata of the field to be matched, so the metadata of the candidate field and the data quality inspection rule matched by it are displayed for the user to choose, and then the user selection is obtained. Candidate field metadata and its matching data quality inspection rules, replace the field name information and field source information contained in the data quality inspection rules with the name information and source information of the field metadata to be matched, and obtain the user input New condition parameter, replace the condition parameter contained in the data quality inspection rule selected by the user with the new condition parameter entered by the user, that is, based on the template of the original data quality inspection rule, according to the name information, source information and The new condition parameters input by the user are used to change the field name information, field source information and condition parameters of the data quality inspection rules to obtain the new data quality inspection rules. Since the field name information, field source information and condition parameters of the new data quality inspection rules are determined by the The data information of the metadata of the matching field is changed, so that the correlation between the metadata of the field to be matched and the new data quality inspection rule will meet the standard, so that the metadata of the field to be matched and the new data quality inspection rule can be used. The new data quality inspection rule performs quality inspection on the business data described by the metadata of the field to be matched.
附图说明Description of drawings
图1是数据质量检查规则匹配方法的流程示意图。FIG. 1 is a schematic flowchart of a data quality inspection rule matching method.
具体实施方式Detailed ways
以下结合具体实施方式对本发明创造作进一步详细说明。The invention will be described in further detail below in conjunction with specific embodiments.
本实施例提供一种数据质量检查规则匹配系统,该系统包括相互连接的计算机可读存储介质和处理器,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如图1所示的数据质量检查规则匹配方法,该方法具体包括如下步骤A、B、C、D、E、F。This embodiment provides a data quality inspection rule matching system, the system includes a computer-readable storage medium and a processor connected to each other, a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor, the following steps are implemented: The data quality inspection rule matching method shown in FIG. 1 specifically includes the following steps A, B, C, D, E, and F.
A.从业务系统中采集用于描述业务数据的多个字段元数据及各个字段元数据的名称信息、来源信息和数据类型信息。A. Collect multiple field metadata used to describe business data and the name information, source information and data type information of each field metadata from the business system.
本实施例中,业务系统存储有电网系统在运行时产生的大量业务数据,这些业务数据能反映电网系统的运行状况。数据质量检查规则匹配系统从业务系统中采集用于描述业务数据的多个字段元数据,以及采集各个字段元数据的名称信息、来源信息、数据类型信息等数据信息。例如,字段元数据一的名称信息为name,来源信息为t_userinfo,数据类型信息为文本类;字段元数据二的名称信息为name,来源信息为t_admininfo,数据类型信息为文本类;字段元数据三的名称信息为age,来源信息为t_userinfo,数据类型信息为数值类;。In this embodiment, the business system stores a large amount of business data generated during operation of the power grid system, and these business data can reflect the operation status of the power grid system. The data quality inspection rule matching system collects multiple field metadata used to describe business data from the business system, and collects data information such as name information, source information, and data type information of each field metadata. For example, the name information of field metadata 1 is name, the source information is t_userinfo, and the data type information is text; the name information of field metadata 2 is name, the source information is t_admininfo, and the data type information is text; field metadata 3 The name information is age, the source information is t_userinfo, and the data type information is numeric;.
B.获取预设的多个数据质量检查规则及各个数据质量检查规则所包含的字段名称信息、字段来源信息和条件参数。B. Obtain multiple preset data quality inspection rules and field name information, field source information and condition parameters contained in each data quality inspection rule.
为对业务系统中的业务数据进行质量检查,通常预设有多个数据质量检查规则,每个数据质量检查规则包含有字段名称信息、字段来源信息、条件参数等参数信息。例如:预设有数据质量检查规则一“Select name from t_userinfo where len(name)>8”,其字段名称信息是“name”,字段来源信息是“t_userinfo”,条件参数是“8”;预设有数据质量检查规则二“Select date from t_userinfo where date is null”,其字段名称信息是“date”,字段来源信息是“t_userinfo”,条件参数是“null”;。数据质量检查规则匹配系统获取预设的多个数据质量检查规则,并获取各个数据质量检查规则所包含的字段名称信息、字段来源信息和条件参数。In order to check the quality of business data in the business system, multiple data quality check rules are usually preset, and each data quality check rule includes parameter information such as field name information, field source information, and condition parameters. For example: the data quality inspection rule "Select name from t_userinfo where len(name)>8" is preset, the field name information is "name", the field source information is "t_userinfo", and the condition parameter is "8"; the preset There is data quality inspection rule 2 "Select date from t_userinfo where date is null", the field name information is "date", the field source information is "t_userinfo", and the condition parameter is "null";. The data quality inspection rule matching system obtains multiple preset data quality inspection rules, and obtains field name information, field source information and condition parameters contained in each data quality inspection rule.
C.根据各个字段元数据的名称信息、来源信息和各个数据质量检查规则所包含的字段名称信息、字段来源信息,判断各个字段元数据与各个数据质量检查规则之间的关联度是否达标;C. According to the name information and source information of each field metadata and the field name information and field source information contained in each data quality inspection rule, determine whether the correlation between each field metadata and each data quality inspection rule meets the standard;
系统采用Levenshtein Distance算法计算各个字段元数据的名称信息与各个数据质量检查规则的字段名称信息之间的文本相似度,判断该文本相似度是否达到第一预设值(80%),并计算各个字段元数据的来源信息与各个数据质量检查规则的字段来源信息之间的文本相似度,判断该文本相似度是否达到第二预设值(100%),在字段元数据的名称信息与数据质量检查规则的字段名称信息之间的文本相似度达到第一预设值,且字段元数据的来源信息与数据质量检查规则的字段来源信息之间的文本相似度达到第二预设值的情况下,判断字段元数据与数据质量检查规则之间的关联度达标,否则判断为不达标。例如,对字段元数据一、字段元数据二、字段元数据三、数据质量检查规则一和数据质量检查规则二,系统需计算字段元数据一与数据质量检查规则一之间的关联度,字段元数据一与数据质量检查规则二之间的关联度,字段元数据二与数据质量检查规则一之间的关联度,字段元数据二与数据质量检查规则二之间的关联度,字段元数据三与数据质量检查规则一之间的关联度,字段元数据三与数据质量检查规则二之间的关联度,然后判断这些关联度是否达标,具体如下:The system uses the Levenshtein Distance algorithm to calculate the text similarity between the name information of each field metadata and the field name information of each data quality inspection rule, judge whether the text similarity reaches the first preset value (80%), and calculate each The text similarity between the source information of the field metadata and the field source information of each data quality inspection rule determines whether the text similarity reaches the second preset value (100%). The name information of the field metadata and the data quality When the text similarity between the field name information of the inspection rule reaches the first preset value, and the text similarity between the source information of the field metadata and the field source information of the data quality inspection rule reaches the second preset value , it is judged that the correlation degree between the field metadata and the data quality inspection rules meets the standard, otherwise it is judged as not up to the standard. For example, for field metadata 1, field metadata 2, field metadata 3, data quality inspection rule 1, and data quality inspection rule 2, the system needs to calculate the correlation between field metadata 1 and data quality inspection rule 1, and the field The correlation degree between metadata 1 and data quality inspection rule 2, the correlation degree between field metadata 2 and data quality inspection rule 1, the correlation degree between field metadata 2 and data quality inspection rule 2, field metadata The degree of correlation between 3 and data quality inspection rule 1, the degree of correlation between field metadata 3 and data quality inspection rule 2, and then judge whether these correlation degrees meet the standard, as follows:
系统计算字段元数据一的名称信息“name”与数据质量检查规则一的字段名称信息“name”之间的文本相似度,计算结果为文本相似度100%,达到了第一预设值(80%),然后计算字段元数据一的来源信息“t_userinfo”与数据质量检查规则一的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度100%,达到了第二预设值(100%),在此情况下,系统判断字段元数据一与数据质量检查规则一之间的关联度达标。The system calculates the text similarity between the name information "name" of field metadata 1 and the field name information "name" of data quality inspection rule 1, and the calculation result is that the text similarity is 100%, reaching the first preset value (80 %), and then calculate the text similarity between the source information "t_userinfo" of the field metadata 1 and the field source information "t_userinfo" of the data quality inspection rule 1, and the calculation result is that the text similarity is 100%, reaching the second preset Value (100%). In this case, the system judges that the correlation degree between field metadata 1 and data quality inspection rule 1 meets the standard.
系统计算字段元数据一的名称信息“name”与数据质量检查规则二的字段名称信息“date”之间的文本相似度,计算结果为文本相似度50%,没有达到第一预设值(80%),然后计算字段元数据一的来源信息“t_userinfo”与数据质量检查规则二的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度100%,达到了第二预设值(100%),在此情况下,系统判断字段元数据一与数据质量检查规则二之间的关联度不达标。The system calculates the text similarity between the name information "name" of field metadata 1 and the field name information "date" of data quality inspection rule 2, and the calculation result is that the text similarity is 50%, which does not reach the first preset value (80 %), and then calculate the text similarity between the source information "t_userinfo" of field metadata 1 and the field source information "t_userinfo" of data quality inspection rule 2, the calculation result is 100% text similarity, reaching the second preset Value (100%). In this case, the system judges that the correlation between field metadata 1 and data quality inspection rule 2 is not up to standard.
系统计算字段元数据二的名称信息“name”与数据质量检查规则一的字段名称信息“name”之间的文本相似度,计算结果为文本相似度100%,没有达到第一预设值(80%),然后计算字段元数据二的来源信息“t_admininfo”与数据质量检查规则一的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度50%,没有达到第二预设值(100%),在此情况下,系统判断字段元数据二与数据质量检查规则一之间的关联度不达标。The system calculates the text similarity between the name information "name" of field metadata 2 and the field name information "name" of data quality inspection rule 1, and the calculation result is that the text similarity is 100%, which does not reach the first preset value (80 %), and then calculate the text similarity between the source information "t_admininfo" of the field metadata 2 and the field source information "t_userinfo" of the data quality inspection rule 1, the calculation result is a text similarity of 50%, which does not reach the second preset value (100%), in this case, the system judges that the correlation between field metadata 2 and data quality inspection rule 1 is not up to standard.
系统计算字段元数据二的名称信息“name”与数据质量检查规则二的字段名称信息“date”之间的文本相似度,计算结果为文本相似度50%,没有达到第一预设值(80%),然后计算字段元数据二的来源信息“t_admininfo”与数据质量检查规则二的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度50%,没有达到第二预设值(100%),在此情况下,系统判断字段元数据二与数据质量检查规则二之间的关联度不达标。The system calculates the text similarity between the name information "name" of field metadata 2 and the field name information "date" of data quality inspection rule 2, and the calculation result is that the text similarity is 50%, which does not reach the first preset value (80 %), and then calculate the text similarity between the source information "t_admininfo" of the field metadata 2 and the field source information "t_userinfo" of the data quality inspection rule 2, and the calculation result is that the text similarity is 50%, which does not reach the second preset value (100%), in this case, the system judges that the correlation degree between field metadata 2 and data quality inspection rule 2 is not up to standard.
系统计算字段元数据三的名称信息“age”与数据质量检查规则一的字段名称信息“name”之间的文本相似度,计算结果为文本相似度30%,没有达到第一预设值(80%),然后计算字段元数据三的来源信息“t_userinfo”与数据质量检查规则一的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度100%,达到了第二预设值(100%),在此情况下,系统判断字段元数据三与数据质量检查规则二之间的关联度不达标。The system calculates the text similarity between the name information "age" of field metadata 3 and the field name information "name" of data quality inspection rule 1, and the calculation result is that the text similarity is 30%, which does not reach the first preset value (80 %), and then calculate the text similarity between the source information "t_userinfo" of the field metadata 3 and the field source information "t_userinfo" of the data quality inspection rule 1, and the calculation result is that the text similarity is 100%, reaching the second preset Value (100%). In this case, the system judges that the correlation between field metadata 3 and data quality inspection rule 2 is not up to standard.
系统计算字段元数据三的名称信息“age”与数据质量检查规则二的字段名称信息“date”之间的文本相似度,计算结果为文本相似度30%,没有达到第一预设值(80%),然后计算字段元数据三的来源信息“t_userinfo”与数据质量检查规则二的字段来源信息“t_userinfo”之间的文本相似度,计算结果为文本相似度100%,达到了第二预设值(100%),在此情况下,系统判断字段元数据三与数据质量检查规则二之间的关联度不达标。The system calculates the text similarity between the name information "age" of field metadata 3 and the field name information "date" of data quality inspection rule 2, and the calculation result is that the text similarity is 30%, which does not reach the first preset value (80 %), and then calculate the text similarity between the source information "t_userinfo" of the field metadata 3 and the field source information "t_userinfo" of the data quality inspection rule 2, and the calculation result is that the text similarity is 100%, reaching the second preset Value (100%). In this case, the system judges that the correlation between field metadata 3 and data quality inspection rule 2 is not up to standard.
需要说明的是,Levenshtein Distance算法又称Edit Distance算法,即是编辑距离算法,其通过计算两个字符串之间由一个转成另一个所需要的最小编辑操作次数,得到这两个字符串之间的编辑距离,编辑距离越小,两个字符串的文本相似度越大,其中,编辑操作包括将一个字符替换成另一个字符、插入一个字符、删除一个字符。It should be noted that the Levenshtein Distance algorithm, also known as the Edit Distance algorithm, is an edit distance algorithm, which calculates the minimum number of editing operations required to convert two strings from one to the other, and obtains the distance between the two strings. The smaller the edit distance, the greater the textual similarity between the two strings. The edit operation includes replacing one character with another, inserting a character, and deleting a character.
D.令关联度达标的字段元数据与数据质量检查规则相匹配。D. Match the field metadata that meets the relevant degree with the data quality inspection rules.
在判断出各个字段元数据与各个数据质量检查规则之间的关联度是否达标之后,系统建立关联度达标的字段元数据与数据质量检查规则之间的映射关系,从而令关联度达标的字段元数据与数据质量检查规则相匹配,并不建立关联度不达标的字段元数据与数据质量检查规则之间的映射关系,令关联度不达标的字段元数据与数据质量检查规则不相匹配,具体地:系统判断字段元数据一与数据质量检查规则一之间的关联度达标,字段元数据一与数据质量检查规则二之间的关联度不达标,字段元数据二与数据质量检查规则二之间的关联度不达标,字段元数据二与数据质量检查规则二之间的关联度不达标,字段元数据三与数据质量检查规则一之间的关联度不达标,字段元数据三与数据质量检查规则二之间的关联度不达标,故系统令字段元数据一与数据质量检查规则一相匹配,令字段元数据一与数据质量检查规则二不相匹配,令字段元数据二与数据质量检查规则一不相匹配,令字段元数据二与数据质量检查规则二不相匹配,令字段元数据三与数据质量检查规则一不相匹配,令字段元数据三与数据质量检查规则二不相匹配。也即,字段元数据一匹配有数据质量检查规则一“Select name from t_userinfo where len(name)>8”,而字段元数据二、三都匹配不到数据质量检查规则。After judging whether the degree of relevance between each field metadata and each data quality inspection rule meets the standard, the system establishes a mapping relationship between the field metadata that meets the standard and the data quality inspection rule, so that the field metadata that meets the standard The data matches the data quality inspection rules, and does not establish a mapping relationship between the field metadata that does not meet the relevant requirements and the data quality inspection rules, so that the field metadata that does not meet the relevant requirements do not match the data quality inspection rules. Ground: The system judges that the correlation degree between field metadata 1 and data quality inspection rule 1 is up to standard, the correlation degree between field metadata 1 and data quality inspection rule 2 is not up to standard, and the correlation degree between field metadata 2 and data quality inspection rule 2 is not up to standard. The correlation between field metadata 2 and data quality inspection rule 2 is not up to standard, the correlation between field metadata 3 and data quality inspection rule 1 is not up to standard, and the field metadata 3 and data quality inspection rule 1 are not up to standard. The correlation between inspection rules 2 is not up to standard, so the system makes field metadata 1 match data quality inspection rule 1, field metadata 1 does not match data quality inspection rule 2, and field metadata 2 matches data quality inspection rule 2. Check rule 1 does not match, make field metadata 2 not match data quality check rule 2, make field metadata 3 not match data quality check rule 1, make field metadata 3 not match data quality check rule 2 match. That is, field metadata 1 matches data quality inspection rule 1 "Select name from t_userinfo where len(name)>8", while field metadata 2 and 3 do not match data quality inspection rules.
E.识别多个字段元数据当中已匹配数据质量检查规则的候选字段元数据和未匹配数据质量检查规则的待匹配字段元数据。E. Identify the candidate field metadata that has matched the data quality inspection rule and the to-be-matched field metadata that has not matched the data quality inspection rule among the plurality of field metadata.
本实施例中,系统将已匹配数据质量检查规则一的字段元数据一记为候选字段元数据,将未匹配数据质量检查规则的字段元数据二、三记为待匹配字段元数据,系统对候选字段元数据一和待匹配字段元数据二、三进行识别区分。In this embodiment, the system records the field metadata 1 that has matched the data quality inspection rule 1 as candidate field metadata, and records the field metadata 2 and 3 that do not match the data quality inspection rule as field metadata to be matched. Candidate field metadata 1 and to-be-matched field metadata 2 and 3 are identified and distinguished.
F.对每个待匹配字段元数据执行下述步骤F1、F2、F3、F4。F. Perform the following steps F1, F2, F3, F4 for each field metadata to be matched.
(1)对待匹配字段元数据二执行步骤F1、F2、F3、F4详述如下:(1) Steps F1, F2, F3, and F4 of the second metadata of the field to be matched are described in detail as follows:
F1.判断是否存在与待匹配字段元数据的文本相似度大于预设阈值且数据类型一致的候选字段元数据,若存在则展示该候选字段元数据及其所匹配的数据质量检查规则供用户选择;F1. Determine whether there is any candidate field metadata whose textual similarity with the metadata of the field to be matched is greater than the preset threshold and whose data type is consistent. If so, display the candidate field metadata and its matching data quality inspection rules for users to choose ;
系统先根据待匹配字段元数据的名称信息与各个候选字段元数据的名称信息,计算待匹配字段元数据与各个候选字段元数据的文本相似度,判断是否存在与待匹配字段元数据的文本相似度大于预设阈值(例如80%)的候选字段元数据,若存在则再根据待匹配字段元数据的数据类型信息与该候选字段元数据的数据类型信息,对比判断待匹配字段元数据与该候选字段元数据是否数据类型一致,若存在与待匹配字段元数据的文本相似度大于预设阈值且数据类型一致的候选字段元数据,则展示该候选字段元数据及其所匹配的数据质量检查规则供用户选择。The system first calculates the text similarity between the metadata of the field to be matched and the metadata of each candidate field according to the name information of the metadata of the field to be matched and the name information of each candidate field metadata, and judges whether there is a text similarity with the metadata of the field to be matched If the metadata of the candidate field whose degree is greater than the preset threshold (for example, 80%) exists, compare and judge the metadata of the field to be matched with the data type information of the metadata of the candidate field according to the data type information of the metadata of the field to be matched and the metadata of the candidate field Whether the metadata of the candidate field is of the same data type, if there is metadata of the candidate field whose textual similarity with the metadata of the field to be matched is greater than the preset threshold and the data type is the same, then the metadata of the candidate field and the data quality check that it matches will be displayed Rules for users to choose.
本实施例中,待匹配字段元数据二的名称信息为“name”,数据类型信息为“文本类”,而候选字段元数据有一个,具体为候选字段元数据一,其字段名称信息为“name”,数据类型信息为“文本类”,因此,根据待匹配字段元数据二的名称信息“name”与候选字段元数据一的名称信息“name”,计算待匹配字段元数据二与候选字段元数据一的文本相似度结果为100%,大于预设阈值(80%),即存在与待匹配字段元数据二的文本相似度大于预设阈值的候选字段元数据一,故再根据待匹配字段元数据二的数据类型信息“文本类”与候选字段元数据一的数据类型信息“文本类”,对比判断待匹配字段元数据二与候选字段元数据一的数据类型为一致,即待匹配字段元数据二与候选字段元数据一的文本相似度大于预设阈值且数据类型一致,在此情况下,系统展示候选字段元数据一及其所匹配的数据质量检查规则一“Select name from t_userinfo where len(name)>8”供用户选择。In this embodiment, the name information of the field metadata 2 to be matched is "name", the data type information is "text type", and there is one candidate field metadata, specifically the candidate field metadata 1, and its field name information is " name", the data type information is "text class", therefore, according to the name information "name" of the metadata 2 of the field to be matched and the name information "name" of the metadata 1 of the candidate field, calculate the metadata 2 of the field to be matched and the candidate field The text similarity result of metadata 1 is 100%, which is greater than the preset threshold (80%), that is, there is candidate field metadata 1 whose text similarity with metadata 2 of the field to be matched is greater than the preset threshold, so according to the The data type information "text class" of field metadata 2 and the data type information "text class" of candidate field metadata 1 are compared and judged that the data types of the field metadata 2 to be matched are consistent with the data types of candidate field metadata 1, that is, to be matched The text similarity between field metadata 2 and candidate field metadata 1 is greater than the preset threshold and the data types are consistent. In this case, the system displays candidate field metadata 1 and its matching data quality inspection rule 1 "Select name from t_userinfo where len(name)>8" for users to choose.
在其他实施例中,若候选字段元数据有多个,且其中与待匹配字段元数据二的文本相似度大于预设阈值且数据类型一致的候选字段元数据也有多个,则按照文本相似度由大至小对这多个候选字段元数据进行排序,并选择文本相似度排在前三名的候选字段元数据进行展示供用户选择。In other embodiments, if there are multiple candidate field metadata, and there are multiple candidate field metadata whose textual similarity with field metadata 2 to be matched is greater than a preset threshold and has the same data type, then according to the text similarity The multiple candidate field metadata are sorted from large to small, and the top three candidate field metadata with the text similarity are selected for display by the user.
F2.获取用户选择的候选字段元数据及其所匹配的数据质量检查规则,将该数据质量检查规则所包含的字段名称信息和字段来源信息替换成待匹配字段元数据的名称信息和来源信息。F2. Acquire the metadata of the candidate field selected by the user and the matching data quality inspection rule, and replace the field name information and field source information contained in the data quality inspection rule with the name information and source information of the field metadata to be matched.
本实施例中,系统展示候选字段元数据一及其所匹配的数据质量检查规则一“Select name from t_userinfo where len(name)>8”,用户查看后若觉得该数据质量检查规则一合适,则可对候选字段元数据一及其所匹配的数据质量检查规则一进行选择,则系统获取到用户选择的候选字段元数据一及其所匹配的数据质量检查规则一,然后将SQL引擎中的Sql分块模块将数据质量规则一“Select name from t_userinfo where len(name)>8”分解成select子句“Select name”、from子句“from t_userinfo”和where子句“where len(name)>8”;然后,利用SQL引擎中的参数填充模块将select子句中的字段名称信息“name”替换成待匹配字段元数据二的名称信息“anme”,将from子句中的字段来源信息“t_userinfo”替换成待匹配字段元数据二的来源信息“t_admininfo”。In this embodiment, the system displays candidate field metadata 1 and its matching data quality inspection rule 1 "Select name from t_userinfo where len(name)>8". Candidate field metadata 1 and its matching data quality inspection rule 1 can be selected, and the system will obtain the candidate field metadata 1 and its matching data quality inspection rule 1 selected by the user, and then the Sql in the SQL engine The chunking module decomposes the data quality rule one "Select name from t_userinfo where len(name)>8" into select clause "Select name", from clause "from t_userinfo" and where clause "where len(name)>8 "; Then, use the parameter filling module in the SQL engine to replace the field name information "name" in the select clause with the name information "anme" of the field metadata 2 to be matched, and replace the field source information "t_userinfo" in the from clause " is replaced with the source information "t_admininfo" of metadata 2 of the field to be matched.
F3.获取用户输入的新条件参数,将用户选择的数据质量检查规则所包含的条件参数替换成用户输入的新条件参数,得到新数据质量检查规则。F3. Obtain the new condition parameter input by the user, replace the condition parameter included in the data quality inspection rule selected by the user with the new condition parameter input by the user, and obtain the new data quality inspection rule.
用户在选择了候选字段元数据一及其所匹配的数据质量检查规则一之后,还需按照经验向系统输入数据质量检查规则的新条件参数,例如输入“10”,系统在获取到用户输入的新条件参数“10”之后,将利用SQL引擎中的参数填充模块将where子句中的条件参数“8”替换成用户输入的新条件参数“10”,然后利用SQL引擎中的Sql组合模块将替换后的select子句、from子句和where子句组合得到新数据质量检查规则“Select name from t_admininfo where len(name)>10”。After the user selects the candidate field metadata 1 and the matching data quality inspection rule 1, he needs to input the new condition parameters of the data quality inspection rule to the system according to experience, for example, input "10", the system will obtain the user input After the new condition parameter "10", the parameter filling module in the SQL engine will be used to replace the condition parameter "8" in the where clause with the new condition parameter "10" input by the user, and then the Sql combination module in the SQL engine will be used to The replaced select clause, from clause, and where clause are combined to obtain the new data quality inspection rule "Select name from t_admininfo where len(name)>10".
由步骤F2和F3可知,系统是在原数据质量检查规则一“Select name from t_userinfo where len(name)>8”的模板基础上,按照待匹配字段元数据二的名称信息“name”、来源信息“t_admininfo”及用户输入的新条件参数“10”来改变数据质量检查规则一的字段名称信息“name”、字段来源信息“t_userinfo”及条件参数“8”,这样就能得到新数据质量检查规则“Select name from t_admininfo where len(name)>10”。From steps F2 and F3, it can be seen that the system is based on the template of the original data quality inspection rule 1 "Select name from t_userinfo where len(name)>8", according to the name information "name" and source information "name" of the second metadata of the field to be matched t_admininfo" and the new condition parameter "10" input by the user to change the field name information "name", field source information "t_userinfo" and the condition parameter "8" of the data quality inspection rule 1, so that the new data quality inspection rule " Select name from t_admininfo where len(name)>10".
需要说明的是,SQL是Structured Query Language的缩写,译为“结构化查询语言”,其是一种计算机语言,用于存取、查询、更新和管理关系型数据库中的数据。SQL引擎是数据库重要的子系统之一,它对上负责承接应用程序发送的SQL语句,对下负责指挥执行器运行执行计划。It should be noted that SQL is the abbreviation of Structured Query Language, translated as "Structured Query Language", which is a computer language used to access, query, update and manage data in relational databases. The SQL engine is one of the important subsystems of the database. It is responsible for receiving the SQL statements sent by the application program, and is responsible for directing the executor to run the execution plan.
F4.令待匹配字段元数据与新数据质量检查规则相匹配。F4. Match the metadata of the field to be matched with the new data quality inspection rule.
在得到新数据质量检查规则“Select name from t_admininfo where len(name)>10”之后,待匹配字段元数据二与新数据质量检查规则“Select name from t_admininfowhere len(name)>10”之间的关联度会达标,故系统建立待匹配字段元数据二与新数据质量检查规则“Select name from t_admininfo where len(name)>10”之间的映射关系,从而令待匹配字段元数据二与新数据质量检查规则“Select name from t_admininfo wherelen(name)>10”相匹配,这样就可利用新数据质量检查规则“Select name from t_admininfo where len(name)>10”对待匹配字段元数据二所描述的业务数据进行质量检查。After the new data quality inspection rule "Select name from t_admininfo where len(name)>10" is obtained, the association between metadata 2 of the field to be matched and the new data quality inspection rule "Select name from t_admininfo where len(name)>10" Therefore, the system establishes the mapping relationship between metadata 2 of the field to be matched and the new data quality inspection rule "Select name from t_admininfo where len(name)>10", so that the metadata 2 of the field to be matched and the new data quality The check rule "Select name from t_admininfo wherelen(name)>10" matches, so that the new data quality check rule "Select name from t_admininfo where len(name)>10" can be used to treat the business data described in the matching field metadata 2 Run a quality check.
(2)对待匹配字段元数据三执行步骤F1、F2、F3、F4详述如下:(2) Steps F1, F2, F3, and F4 of the metadata of the field to be matched are executed in detail as follows:
先执行步骤F1:本实施例中,待匹配字段元数据三的名称信息为“age”,数据类型信息为“数值类”,而候选字段元数据有一个,具体为候选字段元数据一,其字段名称信息为“name”,数据类型信息为“文本类”,因此,根据待匹配字段元数据三的名称信息“age”与候选字段元数据一的名称信息“name”,计算待匹配字段元数据三与候选字段元数据一的文本相似度结果为30%,不大于预设阈值(80%),即不存在与待匹配字段元数据三的文本相似度大于预设阈值的候选字段元数据,故无需再对比判断待匹配字段元数据三与候选字段元数据一的数据类型是否为一致,就可得知待匹配字段元数据三与候选字段元数据一之间不是文本相似度大于预设阈值且数据类型一致,在此情况下,系统不会展示候选字段元数据一及其所匹配的数据质量检查规则一“Select name from t_userinfo where len(name)>8”供用户选择。Execute step F1 first: In this embodiment, the name information of the field metadata 3 to be matched is "age", the data type information is "numeric value", and there is one candidate field metadata, specifically the candidate field metadata 1, which The field name information is "name", and the data type information is "text class". Therefore, according to the name information "age" of the metadata 3 of the field to be matched and the name information "name" of the metadata 1 of the candidate field, the metadata of the field to be matched is calculated. The result of the text similarity between data three and candidate field metadata one is 30%, which is not greater than the preset threshold (80%), that is, there is no candidate field metadata whose text similarity with the field metadata three to be matched is greater than the preset threshold , so there is no need to compare and judge whether the data types of the metadata 3 of the field to be matched and the metadata 1 of the candidate field are consistent, and it can be known that the text similarity between the metadata 3 of the field to be matched and the metadata 1 of the candidate field is not greater than the preset threshold and the data type is the same, in this case, the system will not display the candidate field metadata one and the matching data quality inspection rule one "Select name from t_userinfo where len(name)>8" for users to choose.
然后,系统也就无需执行步骤F2、F3、F4。Then, the system does not need to execute steps F2, F3, F4.
如上所述仅为本发明创造的实施方式,不以此限定专利保护范围。本领域技术人员在本发明创造的基础上作出非实质性的变化或替换,仍落入专利保护范围。The above is only the embodiment of the invention, and does not limit the scope of patent protection. Persons skilled in the art make insubstantial changes or replacements on the basis of the invention, which still fall within the scope of patent protection.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211049853.5ACN115328902B (en) | 2022-08-30 | 2022-08-30 | A data quality inspection rule matching method, storage medium and system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211049853.5ACN115328902B (en) | 2022-08-30 | 2022-08-30 | A data quality inspection rule matching method, storage medium and system |
| Publication Number | Publication Date |
|---|---|
| CN115328902Atrue CN115328902A (en) | 2022-11-11 |
| CN115328902B CN115328902B (en) | 2025-05-16 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211049853.5AActiveCN115328902B (en) | 2022-08-30 | 2022-08-30 | A data quality inspection rule matching method, storage medium and system |
| Country | Link |
|---|---|
| CN (1) | CN115328902B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115357572A (en)* | 2022-08-30 | 2022-11-18 | 云南电网有限责任公司信息中心 | Data quality inspection rule construction method, storage medium and system |
| CN115905455A (en)* | 2022-12-31 | 2023-04-04 | 北京和兴创联健康科技有限公司 | Method for standardizing hospital database based on automatic detection technology |
| CN115905319A (en)* | 2022-11-16 | 2023-04-04 | 国网山东省电力公司营销服务中心(计量中心) | A method and system for automatic identification of abnormal electricity bills of massive users |
| CN116151743A (en)* | 2023-02-17 | 2023-05-23 | 中移动信息技术有限公司 | Data processing method, device, electronic equipment and storage medium |
| CN116910496A (en)* | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | Configuration method and device of data quality monitoring rule and readable storage medium |
| CN117149753A (en)* | 2023-08-30 | 2023-12-01 | 中电云计算技术有限公司 | Data checking methods and systems |
| CN117149755A (en)* | 2023-09-05 | 2023-12-01 | 中国银行股份有限公司 | A data quality detection method, system, equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111667923A (en)* | 2020-06-05 | 2020-09-15 | 医渡云(北京)技术有限公司 | Data matching method and device, computer readable medium and electronic equipment |
| WO2020259147A1 (en)* | 2019-06-28 | 2020-12-30 | 深圳前海微众银行股份有限公司 | Field information checking method and apparatus |
| CN112800095A (en)* | 2021-04-13 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
| CN113901075A (en)* | 2021-10-12 | 2022-01-07 | 平安医疗健康管理股份有限公司 | Method, device, computer device and storage medium for generating SQL statement |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020259147A1 (en)* | 2019-06-28 | 2020-12-30 | 深圳前海微众银行股份有限公司 | Field information checking method and apparatus |
| CN111667923A (en)* | 2020-06-05 | 2020-09-15 | 医渡云(北京)技术有限公司 | Data matching method and device, computer readable medium and electronic equipment |
| CN112800095A (en)* | 2021-04-13 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
| CN113901075A (en)* | 2021-10-12 | 2022-01-07 | 平安医疗健康管理股份有限公司 | Method, device, computer device and storage medium for generating SQL statement |
| Title |
|---|
| 方利 等: "基于元数据和质量规则的土地数据检查", 《地球信息科学》, no. 3, 30 September 2004 (2004-09-30), pages 19 - 23* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115357572A (en)* | 2022-08-30 | 2022-11-18 | 云南电网有限责任公司信息中心 | Data quality inspection rule construction method, storage medium and system |
| CN115905319A (en)* | 2022-11-16 | 2023-04-04 | 国网山东省电力公司营销服务中心(计量中心) | A method and system for automatic identification of abnormal electricity bills of massive users |
| CN115905319B (en)* | 2022-11-16 | 2024-04-19 | 国网山东省电力公司营销服务中心(计量中心) | A method and system for automatically identifying abnormal electricity charges of massive users |
| CN115905455A (en)* | 2022-12-31 | 2023-04-04 | 北京和兴创联健康科技有限公司 | Method for standardizing hospital database based on automatic detection technology |
| CN115905455B (en)* | 2022-12-31 | 2023-09-29 | 北京和兴创联健康科技有限公司 | Method for normalizing hospital database based on automatic detection technology |
| CN116151743A (en)* | 2023-02-17 | 2023-05-23 | 中移动信息技术有限公司 | Data processing method, device, electronic equipment and storage medium |
| CN117149753A (en)* | 2023-08-30 | 2023-12-01 | 中电云计算技术有限公司 | Data checking methods and systems |
| CN117149755A (en)* | 2023-09-05 | 2023-12-01 | 中国银行股份有限公司 | A data quality detection method, system, equipment and storage medium |
| CN117149755B (en)* | 2023-09-05 | 2025-09-16 | 中国银行股份有限公司 | Method, system, equipment and storage medium for detecting data quality |
| CN116910496A (en)* | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | Configuration method and device of data quality monitoring rule and readable storage medium |
| CN116910496B (en)* | 2023-09-14 | 2024-01-23 | 深圳市智慧城市科技发展集团有限公司 | Configuration method and device of data quality monitoring rule and readable storage medium |
| Publication number | Publication date |
|---|---|
| CN115328902B (en) | 2025-05-16 |
| Publication | Publication Date | Title |
|---|---|---|
| CN115328902A (en) | Data quality inspection rule matching method, storage medium and system | |
| US7155427B1 (en) | Configurable search tool for finding and scoring non-exact matches in a relational database | |
| US7562088B2 (en) | Structure extraction from unstructured documents | |
| US8190616B2 (en) | Statistical measure and calibration of reflexive, symmetric and transitive fuzzy search criteria where one or both of the search criteria and database is incomplete | |
| US8332366B2 (en) | System and method for automatic weight generation for probabilistic matching | |
| JP4997856B2 (en) | Database analysis program, database analysis apparatus, and database analysis method | |
| US9195952B2 (en) | Systems and methods for contextual mapping utilized in business process controls | |
| US9514219B2 (en) | System and method for classifying documents via propagation | |
| CN117851575B (en) | Large language model question-answer optimization method and device, electronic equipment and storage medium | |
| CN106874491A (en) | A kind of device fault information method for digging based on dynamic association rules | |
| US20220004885A1 (en) | Computer system and contribution calculation method | |
| KR100877156B1 (en) | Dictionary performance analysis system and method for atypical query language | |
| US20220327162A1 (en) | Information search system | |
| US11308103B2 (en) | Data analyzing device and data analyzing method | |
| CN112786124B (en) | Problem troubleshooting method and device, storage medium and equipment | |
| CN119180266A (en) | Historical data-based audit opinion generation method, device and equipment | |
| CN116610810A (en) | Intelligent searching method and system based on regulation and control of cloud knowledge graph blood relationship | |
| JP2020155001A (en) | Data visualization system and data visualization program | |
| JP2008117280A (en) | Software source code search method and system | |
| US20100185606A1 (en) | Development document data management device, development document data management system, development document data management method, program therefor, and recording medium | |
| CN115374210B (en) | A distribution network diagnosis and analysis method and terminal | |
| CN115438036B (en) | Data redundancy processing system and method for unified index database of power grid | |
| CN118260375B (en) | AIGC-based knowledge management method, AIGC-based knowledge management system, AIGC-based knowledge management medium and AIGC-based knowledge management equipment | |
| JP2025159149A (en) | Information retrieval method and information retrieval system | |
| JP2024098671A (en) | Information retrieval method and information retrieval system |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |