Movatterモバイル変換


[0]ホーム

URL:


CN109344145A - A kind of data cleaning method based on data standard specification, device and system - Google Patents

A kind of data cleaning method based on data standard specification, device and system
Download PDF

Info

Publication number
CN109344145A
CN109344145ACN201811040620.2ACN201811040620ACN109344145ACN 109344145 ACN109344145 ACN 109344145ACN 201811040620 ACN201811040620 ACN 201811040620ACN 109344145 ACN109344145 ACN 109344145A
Authority
CN
China
Prior art keywords
data
work order
standard specification
problem report
report work
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811040620.2A
Other languages
Chinese (zh)
Other versions
CN109344145B (en
Inventor
刘汉亮
邓强
宋勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIMING SOFTWARE Co Ltd
Original Assignee
BEIMING SOFTWARE Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIMING SOFTWARE Co LtdfiledCriticalBEIMING SOFTWARE Co Ltd
Priority to CN201811040620.2ApriorityCriticalpatent/CN109344145B/en
Publication of CN109344145ApublicationCriticalpatent/CN109344145A/en
Application grantedgrantedCritical
Publication of CN109344145BpublicationCriticalpatent/CN109344145B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The invention discloses a kind of data cleaning methods based on data standard specification, device and system, method includes the following steps: obtaining data standard specification information and data source;Quality testing is carried out to data source according to data standard specification information, problem report work order is generated and problem report work order is sent to the first processing account;After problem report work order is processed, processed problem report work order is stored in knowledge base.The present invention is based on normal data specification information, quality testing is carried out to the data source that needs clean, and it generates problem report work order and is sent to relevant processing account, after handling processing of people's completion to problem report work order, by the storage of problem report work order into knowledge base, the solution that the problem of having completed processing reports work order is used for reference in order to handle people in follow-up data cleaning process, to promote the efficiency of data cleansing.The present invention can be widely applied to data processing field.

Description

A kind of data cleaning method based on data standard specification, device and system
Technical field
The present invention relates to data processing field, especially a kind of data cleaning method based on data standard specification, deviceAnd system.
Background technique
With the rapid progress of society, the data of the generations such as portable computer are increased with daily several hundred million, the number to come into beingIt is also further extensive according to the application of cleaning technique, so it is most important for effectively obtaining useful information from the data of magnanimity's.
Data cleansing (Data cleaning) is exactly that " dirty data " is washed in literal meaning, and data cleansing refers toIt was found that and correct last one of program of identifiable wrong data in data file, and " dirty data " is broadly divided into data and lacksMistake, Data duplication, error in data and unavailable four major class of data.However for different types of data, there are different clear at presentMode is washed, thus is needed using different data standard specifications.
There is no consolidation is carried out to problem report work order in available data cleaning method, lead to nothing in subsequent cleaning processMethod reuses the problems in problem report work order phenomenon and solution, and to a certain extent, prior art efficiency is stillThere is room for improvement.
Summary of the invention
In order to solve the above technical problems, it is an object of the invention to: provide it is a kind of be able to ascend efficiency based on standard adviseThe data cleaning method of model, device and system.
The first technical solution adopted by the present invention is:
A kind of data cleaning method based on data standard specification, comprising the following steps:
Obtain data standard specification information and data source;
Quality testing is carried out to data source according to data standard specification information, generates problem report work order and by problem reportWork order is sent to the first processing account;
After problem report work order is processed, processed problem report work order is stored in knowledge base.
Further, described that quality testing is carried out to data source according to data standard specification information, generate problem report work orderAnd problem report work order is sent to the first processing account, the step for specifically include:
According to the data standard specification of each field in data standard specification asset data source;
Addition data quality checking task, the first processing account of configuration simultaneously execute task schedule, obtain each word in data sourceThe quality measurements of section;
Problem report work order is generated according to the quality measurements of field each in data source and sends problem report work orderTo the first processing account.
Further, further comprising the steps of:
According to data standard specification information, inquiry uses identical data standard criterion and processed problem from knowledge baseReport work order.
Further, further comprising the steps of:
The first information of user's input is obtained, is searched in knowledge base comprising the first information and processed according to the first informationThe problem of report work order.
Second of technical solution adopted by the present invention is:
A kind of data cleansing device based on data standard specification, comprising:
Memory, for storing program;
Processor executes a kind of data cleaning method based on data standard specification for loading described program.
The third technical solution adopted by the present invention is:
A kind of Data clean system based on data standard specification, comprising:
Module is obtained, for obtaining data source;
Data standard specification information management module, for adding, modifying and deleting data standard specification information;
Quality detection module generates problem report for carrying out quality testing to data source according to data standard specification informationIt accuses work order and problem report work order is sent to the first processing account;
Problem report worksheet module, for handling problem report work order;
Knowledge base reports work order for inquiring and storing the problem of having handled.
Further, the quality detection module includes:
Configuration unit is mapped, for advising according to the data standard of each field in data standard specification asset data sourceModel;
Task execution schedule unit, for adding data quality checking task, the first processing account of configuration and executing taskScheduling, obtains the quality measurements of each field in data source;
Workform management unit, for generating problem report work order according to the quality measurements of field each in data source and inciting somebody to actionProblem report work order is sent to the first processing account.
Further, further includes:
Enquiry module, for according to data standard specification information, inquiry to use identical data standard criterion from knowledge baseAnd processed problem report work order.
Further, further includes:
Search module is searched in knowledge base comprising the according to the first information for obtaining the first information of user's inputOne information and processed problem report work order.
Further, the workform management unit is also used to:
Problem report work order is assigned to second processing account from the first processing account by the second information for obtaining user's inputNumber;
Or
Problem report work order, is sent to the external system of setting by the third information for obtaining user's input.
The beneficial effects of the present invention are: being carried out the present invention is based on normal data specification information to the data source that needs cleanQuality testing, and generate problem report work order and be sent to relevant processing account, when processing people completes to problem report work orderIt is complete in order to handle people's reference in follow-up data cleaning process by the storage of problem report work order into knowledge base after processingThe solution that work order is reported at the problem of processing, to promote the efficiency of data cleansing.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the data cleaning method based on data standard specification of specific embodiment of the present invention.
Specific embodiment
The present invention is further detailed with specific embodiment with reference to the accompanying drawings of the specification.
Referring to Fig.1, a kind of data cleaning method based on data standard specification, this method can be realized by computer.
It the described method comprises the following steps:
S1, data standard specification information and data source are obtained.The data standard specification information may include a plurality of rule,Processing people can increase, delete and modify to the rule in data standard criterion information according to actual needs.
S2, quality testing is carried out to data source according to data standard specification information, generates problem report work order and by problemReport work order is sent to the first processing account.During carrying out quality testing to data source, it is found that existing for data sourceThe situation that problem, i.e. discovery data source do not meet the rule in data standard specification information, problem report work order will record dataThe problems of source, for example, record N field m-th data it is problematic.Then the data problem of data source is had recordedProblem report work order can be transferred to the account of processing people, i.e., the first processing account, and the first processing account can be fixed,It is also possible to set during each data cleansing.
S3, after problem report work order is processed, will processed problem report work order be stored in knowledge base in.Wherein,It will record the solution of processing people in the problem of processing report work order.For example, the m-th data of n-th field are there are problem,Solution for this problem is to be deleted the data, merged, replacing either other operations.In this way, if rearDuring continuous data cleansing, processing people encounters similar problem, the solution before can finding, and helps to promote numberAccording to the efficiency of cleaning.
As preferred embodiment, the step S2 is specifically included:
S21, according to the data standard specification of each field in data standard specification asset data source;It will be in data sourceEach field data standard specification corresponding with each field establishes association by way of mapping.
S22, addition data quality checking task, the first processing account of configuration simultaneously execute task schedule, obtain in data sourceThe quality measurements of each field;Method in the present embodiment may be performed simultaneously multiple data cleansing tasks, it is therefore desirable to increaseIf the function of task schedule.
S23, problem report work order is generated according to the quality measurements of field each in data source and by problem report work orderIt is sent to the first processing account.In the present embodiment, data problem existing for each field is included in problem report work order.
As preferred embodiment, the solution that people uses for reference passing problem report work order, this implementation are handled for convenienceExample is further comprising the steps of:
S4, according to data standard specification information, inquiry is using identical data standard criterion and processed from knowledge baseProblem report work order.The present embodiment can be according to the processing selected data standard specification information of people, automatically from knowledge baseIt is presented with the case for using identical data standard criterion, and to user.User is allowed easily to find CROSS REFERENCESolution, to promote the efficiency of data cleansing.
It is further comprising the steps of as preferred embodiment:
S5, the first information for obtaining user's input, are searched in knowledge base comprising the first information and according to the first informationThe problem of processing, reports work order.In the present embodiment, user can be scanned for by inputting the first information, and the first information canTo be title or the format of handled data etc. of relevant field, the present embodiment can be passing identical there is no usingIn the case where the data cleansing case of data standard specification, approximation is searched in processed problem report work order using keywordData cleansing scheme, in order to handle the solution that people uses for reference passing data cleansing case, to promote the effect of data cleansingRate.
A kind of data cleansing device based on data standard specification, comprising:
Memory, for storing program;The memory can be the storage equipment such as USB flash disk, hard disk or CD.
Processor executes the data based on data standard specification of any of the above-described kind of embodiment for loading described programCleaning method.
Present embodiment discloses a kind of Data clean systems based on data standard specification, comprising:
Module is obtained, for obtaining data source;The data source can be from the data-interface, local of external systemDatabase or storage medium.
Data standard specification information management module, for adding, modifying and deleting data standard specification information;The dataStandard criterion information may include a plurality of rule, and processing people can be according to actual needs to the rule in data standard criterion informationIncreased, deleted and is modified.
Quality detection module generates problem report for carrying out quality testing to data source according to data standard specification informationIt accuses work order and problem report work order is sent to the first processing account.During carrying out quality testing to data source, it can send outExisting data source there are the problem of, i.e. the discovery data source situation that does not meet the rule in data standard specification information, problem reportWork order will record the problems of data source, for example, record N field m-th data it is problematic.Then data are had recordedThe problem of data problem in source report work order can be transferred to the account of processing people, i.e., the first processing account, the first processing accountIt number can be fixed, be also possible to set during each data cleansing.
Problem report worksheet module, for handling problem report work order;In this module, processing people can be logged in certainlyOneself account, and problem report work order is handled, for example, can pass through aiming at the problem that being pointed out in problem report work orderThe modes such as deletion, increase and modification are handled.Last solution can be stored in knowledge base with problem report work order.
Knowledge base reports work order for inquiring and storing the problem of having handled.Processing people can search in knowledge baseThere are problems that the solution of similar situation report work order, in the past to promote the efficiency of data cleansing.
This system can manage data standard specification information convenient for processing people, improve the flexibility ratio of data cleansing, andExisting problem report work order can be made full use of as the case used for reference, promote the efficiency of data cleansing.
As preferred embodiment, the quality detection module includes:
Configuration unit is mapped, for advising according to the data standard of each field in data standard specification asset data sourceModel.Mapping configuration unit establishes each field data standard specification corresponding with each field in data source by way of mappingAssociation.
Task execution schedule unit, for adding data quality checking task, the first processing account of configuration and executing taskScheduling, obtains the quality measurements of each field in data source;It is clear that system in the present embodiment may be performed simultaneously multiple dataWash task, it is therefore desirable to add the function of task schedule.
Workform management unit, for generating problem report work order according to the quality measurements of field each in data source and inciting somebody to actionProblem report work order is sent to the first processing account.In the present embodiment, number existing for each field is included in problem report work orderAccording to problem.
As preferred embodiment, the solution that people uses for reference passing problem report work order, this implementation are handled for convenienceExample further include:
Enquiry module, for according to data standard specification information, inquiry to use identical data standard criterion from knowledge baseAnd processed problem report work order.The present embodiment can according to processing the selected data standard specification information of people, automatically fromMatching uses the case of identical data standard criterion in knowledge base, and presents to user.User is easily looked forTo the solution of CROSS REFERENCE, to promote the efficiency of data cleansing.
As preferred embodiment, further includes:
Search module is searched in knowledge base comprising the according to the first information for obtaining the first information of user's inputOne information and processed problem report work order.In the present embodiment, user can be scanned for by inputting the first information, describedThe first information can be title or format of handled data of relevant field etc., and the present embodiment can be not present passingIn the case where data cleansing case using identical data standard specification, using keyword in processed problem report work orderIt is middle to search approximate data cleansing scheme, in order to handle the solution that people uses for reference passing data cleansing case, to promote numberAccording to the efficiency of cleaning.
As preferred embodiment, for the ease of problem report work order is turned processing, the workform management unit is also used to:
Problem report work order is assigned to second processing account from the first processing account by the second information for obtaining user's inputNumber;
Or
Problem report work order, is sent to the external system of setting by the third information for obtaining user's input.
The present embodiment neatly assignment problem can report that work order, to handle, can also will be asked to different processing peopleTopic report work order is sent to external system.
For the step number in above method embodiment, it is arranged only for the purposes of illustrating explanation, between stepSequence do not do any restriction, the execution of each step in embodiment sequence can according to the understanding of those skilled in the art come intoRow is adaptively adjusted.
It is to be illustrated to preferable implementation of the invention, but the present invention is not limited to the embodiment above, it is ripeVarious equivalent deformation or replacement can also be made on the premise of without prejudice to spirit of the invention by knowing those skilled in the art, thisEquivalent deformation or replacement are all included in the scope defined by the claims of the present application a bit.

Claims (10)

CN201811040620.2A2018-09-072018-09-07Data standard specification-based data cleaning method, device and systemActiveCN109344145B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201811040620.2ACN109344145B (en)2018-09-072018-09-07Data standard specification-based data cleaning method, device and system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201811040620.2ACN109344145B (en)2018-09-072018-09-07Data standard specification-based data cleaning method, device and system

Publications (2)

Publication NumberPublication Date
CN109344145Atrue CN109344145A (en)2019-02-15
CN109344145B CN109344145B (en)2022-12-27

Family

ID=65304922

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201811040620.2AActiveCN109344145B (en)2018-09-072018-09-07Data standard specification-based data cleaning method, device and system

Country Status (1)

CountryLink
CN (1)CN109344145B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113032669A (en)*2021-03-092021-06-25国轩高科美国研究院Product problem processing method, device and equipment
CN114066170A (en)*2021-10-222022-02-18广西贵港市中科曙光云计算有限公司Government data open sharing-oriented problem feedback processing system and method

Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20050210024A1 (en)*2004-03-222005-09-22Microsoft CorporationSearch system using user behavior data
US20080288889A1 (en)*2004-02-202008-11-20Herbert Dennis HuntData visualization application
CN101739618A (en)*2009-12-212010-06-16北京世纪互联宽带数据中心有限公司Integrated service processing system
US20100191369A1 (en)*2006-11-032010-07-29Yeong-Ae KimSystem of management, information providing and information acquisition for vending machine based upon wire and wireless communication and a method of management, information providing and information acquisition for vending machine
CN101853277A (en)*2010-05-142010-10-06南京信息工程大学 A Vulnerability Data Mining Method Based on Classification and Association Analysis
CN102394885A (en)*2011-11-092012-03-28中国人民解放军信息工程大学Information classification protection automatic verification method based on data stream
US20120179564A1 (en)*2005-09-142012-07-12Adam SorocaSystem for retrieving mobile communication facility user data from a plurality of providers
CN103678665A (en)*2013-12-242014-03-26焦点科技股份有限公司Heterogeneous large data integration method and system based on data warehouses
CN103902731A (en)*2014-04-162014-07-02国家电网公司Intelligent information maintenance method based on knowledge base inquiry
CN105808939A (en)*2016-03-042016-07-27新博卓畅技术(北京)有限公司Data rule engine system and method
CN106294492A (en)*2015-06-082017-01-04深圳中兴网信科技有限公司Data cleaning method and cleaning engine
CN106611053A (en)*2016-12-262017-05-03河南信安通信技术股份有限公司Data cleaning and indexing method
CN106777227A (en)*2016-12-262017-05-31河南信安通信技术股份有限公司Multidimensional data convergence analysis system and method based on cloud platform
CN106815338A (en)*2016-12-252017-06-09北京中海投资管理有限公司A kind of real-time storage of big data, treatment and inquiry system
CN106951315A (en)*2017-03-172017-07-14北京搜狐新媒体信息技术有限公司A kind of data task dispatching method and system based on ETL
CN107239581A (en)*2017-07-072017-10-10小草数语(北京)科技有限公司Data cleaning method and device
CN108169621A (en)*2017-12-052018-06-15国电南瑞科技股份有限公司Taiwan area power-off event complementing method based on support vector machines

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20080288889A1 (en)*2004-02-202008-11-20Herbert Dennis HuntData visualization application
US20050210024A1 (en)*2004-03-222005-09-22Microsoft CorporationSearch system using user behavior data
US20120179564A1 (en)*2005-09-142012-07-12Adam SorocaSystem for retrieving mobile communication facility user data from a plurality of providers
US20100191369A1 (en)*2006-11-032010-07-29Yeong-Ae KimSystem of management, information providing and information acquisition for vending machine based upon wire and wireless communication and a method of management, information providing and information acquisition for vending machine
CN101739618A (en)*2009-12-212010-06-16北京世纪互联宽带数据中心有限公司Integrated service processing system
CN101853277A (en)*2010-05-142010-10-06南京信息工程大学 A Vulnerability Data Mining Method Based on Classification and Association Analysis
CN102394885A (en)*2011-11-092012-03-28中国人民解放军信息工程大学Information classification protection automatic verification method based on data stream
CN103678665A (en)*2013-12-242014-03-26焦点科技股份有限公司Heterogeneous large data integration method and system based on data warehouses
CN103902731A (en)*2014-04-162014-07-02国家电网公司Intelligent information maintenance method based on knowledge base inquiry
CN106294492A (en)*2015-06-082017-01-04深圳中兴网信科技有限公司Data cleaning method and cleaning engine
CN105808939A (en)*2016-03-042016-07-27新博卓畅技术(北京)有限公司Data rule engine system and method
CN106815338A (en)*2016-12-252017-06-09北京中海投资管理有限公司A kind of real-time storage of big data, treatment and inquiry system
CN106611053A (en)*2016-12-262017-05-03河南信安通信技术股份有限公司Data cleaning and indexing method
CN106777227A (en)*2016-12-262017-05-31河南信安通信技术股份有限公司Multidimensional data convergence analysis system and method based on cloud platform
CN106951315A (en)*2017-03-172017-07-14北京搜狐新媒体信息技术有限公司A kind of data task dispatching method and system based on ETL
CN107239581A (en)*2017-07-072017-10-10小草数语(北京)科技有限公司Data cleaning method and device
CN108169621A (en)*2017-12-052018-06-15国电南瑞科技股份有限公司Taiwan area power-off event complementing method based on support vector machines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARINDAM PAUL: ""HADCLEAN: A hybrid approach to data cleaning in data warehouses"", 《2012 INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL & KNOWLEDGE MANAGEMENT》*
王曰芬 等: ""数据清洗研究综述"", 《现代图书情报技术》*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113032669A (en)*2021-03-092021-06-25国轩高科美国研究院Product problem processing method, device and equipment
CN114066170A (en)*2021-10-222022-02-18广西贵港市中科曙光云计算有限公司Government data open sharing-oriented problem feedback processing system and method

Also Published As

Publication numberPublication date
CN109344145B (en)2022-12-27

Similar Documents

PublicationPublication DateTitle
CN102725753B (en) Method and device for optimizing data access, method and device for optimizing data storage
US20160092596A1 (en)Database migration method and apparatus
US8612532B2 (en)System and method for optimizing response handling time and customer satisfaction scores
CN107003935A (en)Optimize database duplicate removal
CN113868507A (en) Method, device and electronic device for acquiring bidding information combining RPA and AI
CN112667805A (en)Work order category determination method, device, equipment and medium
CN104699796A (en)Data cleaning method based on data warehouse
US20090083221A1 (en)System and Method for Estimating and Storing Skills for Reuse
US11197597B2 (en)System and method for a task management and communication system
CN109344145A (en)A kind of data cleaning method based on data standard specification, device and system
CN118467522A (en)Base layer data cleaning method, device, equipment and medium based on rule engine
BR102013001760A2 (en) project management system based on associative memory.
CN102959548A (en)Data storage method, search method and device
US20080114627A1 (en)System and Method for Capturing Process Instance Information in Complex or Distributed Systems
CN108897873B (en)Method and device for generating job file, storage medium and processor
CN118132448B (en)Test case processing method, device, computer equipment and storage medium
CN113961636A (en)Object relation query method and device, computer equipment and storage medium
US20080114626A1 (en)System and Method for Capturing Process Instance Information
JP4080707B2 (en) Recording system for recording processing information of multiple systems
CN106778048B (en) Method and device for data processing
CN111352824A (en)Test method and device and computer equipment
JP7284687B2 (en) Mapping system and mapping method
CN109710818B (en) Answer weight determination method, answer determination method, device and storage medium
CN113392069A (en)Method and system for cleaning and maintaining elastic search log index file
CN110704605B (en)Automatic generation method, system, equipment and readable storage medium for article abstract

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp