A kind of data cleaning method for big dataTechnical field
This application involves big data technical field, in particular to a kind of data cleaning method for big data.
Background technique
It is increasingly developed and universal with computer technology, when today's society enters big data from the information ageGeneration.Everyone every act and every move is collected in a large amount of data of generation, these data by different information systems.Enterprise needsAccording to the different behaviors and hobby of the data analysis user being collected into, better service is provided for user, but when the data collectedThe quality of data that most of information systems are all unable to ensure its collection when reaching TB, PB or even EB rank is able to satisfy the need of userIt asks.The factor for influencing the quality of data mainly has: shortage of data, data are out-of-date, error in data, Data duplication, data collision etc..ForThe quality of data is improved, data cleansing technology is most important.Data cleansing provides the data service of high quality for enterprise operation,Also reliable data basis is provided for data mining.
Data cleansing refers to by the mistake or redundancy in the detection and transformation elimination data to data, to be metIt is required that quality data.In the prior art, the main means that data cleansing uses include: (1) based on data communityConstraint handles data, but this method need design constraint function in advance and if constraint function consider comprehensively if having canPart useful information can be accidentally deleted, shortage of data is caused.Human intervention is then needed in order to improve the validity of this method, that is, is being cleanedIt is cleaned when can not be handled such as constraint function in the process by the feedback operation of people, due to increasing human intervention data cleansingAccuracy can beat raising but data cleansing consumed by the time can also greatly increase, and excessively rely on people subjectivity sentenceIt is disconnected.(2) method that machine learning is used during data cleansing, i.e. precondition go out to be used for the machine learning mould of data cleansingType, during subsequent data cleansing, continuous cumulative learning.This mode eliminates human intervention, improves the effect of cleaningRate, but accurate rate is declined, it is higher simultaneously for model needs, when data content format more diversification, clean matterAmount will also be a greater impact.
The prior art at least has the following technical problems: cleaning efficiency is lower, and cleaning accuracy is lower, can not be suitable for needleData cleansing to big data.
Summary of the invention
The purpose of the embodiment of the present application is to provide a kind of data cleaning method for big data, to improve cleaning efficiency,Cleaning accuracy is improved, makes data cleaning method that there is versatility and adaptability, to be suitable for the cleaning of big data.
The embodiment of the present application provides a kind of data cleaning method for big data and is achieved in that
A kind of data cleaning method for big data, which comprises
Spark cluster is built, component needed for configuring;
Data cleansing rule base is established, the data cleansing rule base includes at least business rule;
It according to the cleaning rule library, is pre-processed to data need to be cleaned, obtains pretreated to clean data;
According to the business rule and the pending data, Job division is carried out, the multiple of the Spark cluster are obtainedJob is cleaned, each described cleaning Job is mapped to specific business demand;
By the pretreated data cleansing task that need to clean data, correspondence is assigned to the more of the Spark clusterIn a cleaning Job, each cleaning Job pretreated need to clean number to described according to the requirement of the data cleansing rule baseIt is cleaned according to using tree-like cleaning structure, obtains final wash result.
It is described to use tree-like cleaning structure to clean the pretreated data that clean in preferred embodiment,Obtain the mode of final wash result, comprising:
Each cleaning Job tentatively completes corresponding data cleansing task according to the data cleansing rule base;
After the completion of the data cleansing task of the multiple cleaning Job is preliminary, the different numbers for corresponding to each cleaning Job are generatedAccording to table;
If obtained elasticity distribution formula data set is not necessarily to after the completion of the data cleansing task of the multiple cleaning Job is preliminaryIt cleans, is then stored in the elasticity distribution formula data set in the different data table according to business demand again, obtain finalWash result;
After the completion of if the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data set needsIt cleans again, then the elasticity distribution formula data set is assigned to corresponding cleaning Job again and is cleaned again, until obtainingNew elasticity distribution formula data set without handling again, then the new elasticity distribution formula data set is deposited according to business demandStorage obtains final wash result in the different data table.
It is described according to the cleaning rule library in preferred embodiment, it is pre-processed to data need to be cleaned, comprising:
According to the data formatization rule in the cleaning rule library, the data that need to clean are formatted;
Approximately duplicated data detection and cleaning are carried out to the data that need to clean;
Error value detection and cleaning are carried out to the data that need to clean.
It is described that approximately duplicated data detection and cleaning are carried out to the data that need to clean in preferred embodiment, comprising:
Keyword is selected from the attribute of the every data that need to clean data;
According to the importance of the keyword, the data that need to clean are ranked up;
Using the sliding window of predefined size, needed described in slide collection cleaning data and calculation window in data it is similarDegree, detects approximately duplicated data;
The approximately duplicated data is cleaned.
In preferred embodiment, the mode that the approximately duplicated data is cleaned, comprising:
According to the renewal time of the approximately duplicated data, the approximately duplicated data is ranked up and is assigned differentWeight, the time, the more close weight was bigger;
A plurality of repeated data is merged into the highest result of confidence level.
It is described that error value detection and cleaning are carried out to the data that need to clean in preferred embodiment, comprising:
The data attribute constraint condition that function defines is relied on using condition, detects the error value that need to clean data, andThe error value is modified.
In preferred embodiment, the required component includes:
Open source distributed memory system, open source distributed resource management frame, resource manager, large-scale parallel inquiry are drawnIt holds up.
In preferred embodiment, the cleaning rule library further includes merging cleaning rule, missing data processing rule.
It is described to the mode that need to be cleaned data and be formatted in preferred embodiment, comprising:
Data normalization processing;
Using forbidden character alternative by data format and standardization.
Using a kind of data cleaning method for big data provided by the embodiments of the present application, firstly, being cleaned in big dataBefore define data cleansing rule base, data cleansing result meets the definition of rule base.When business demand, data format etc. occurWhen change, it is only necessary to change corresponding cleaning rule and just be applicable to new data demand.With stronger versatility and adaptationProperty.Secondly, the present invention carries out data cleansing using open source Distributed Architecture Spark, Spark is data processing frame memory-basedFrame, because the speed ratio of access data access speed in disk is fast in memory, it is possible to effectively improve data cleansingEfficiency.Simultaneously as establishing data cleansing rule base, cleaned according to fixed rule, is not necessarily to human intervention, reducesThe time of artificial treatment can be further improved data cleansing efficiency.In addition, data cleansing module is mutually indepedent, each partThe modification and increase of function all do not interfere with the operation of other modules, have good scalability.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only thisThe some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor propertyUnder, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of method flow diagram for data cleaning method for big data that the application one embodiment provides;
Fig. 2 is the cleaning process figure for the data cleansing that the application one embodiment provides;
Fig. 3 is the structural schematic diagram for the data cleansing dividing elements that the application one embodiment provides;
Fig. 4 is the cleaning structure signal cleaned using tree-like cleaning structure that the application one embodiment providesFigure.
Specific embodiment
The embodiment of the present application provides a kind of data cleaning method for big data.
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application realityThe attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementationExample is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is commonThe application protection all should belong in technical staff's every other embodiment obtained without creative effortsRange.
Fig. 1 is a kind of a kind of method flow signal of embodiment of herein described data cleaning method for big dataFigure.Although this application provides as the following examples or method operating procedure shown in the drawings or apparatus structure, based on conventionalIt either may include more or less operating procedure or module list in the method or device without creative laborMember.In the step of there is no necessary causalities in logicality or structure, the execution sequence of these steps or the module of deviceStructure is not limited to the embodiment of the present application or execution shown in the drawings sequence or modular structure.The method or modular structureDevice or end product in practice according to embodiment or method shown in the drawings or modular structure in application, can carry out suitableSequence execute or it is parallel execute (such as parallel processor or multiple threads environment, even include distributed treatment realityApply environment).
Specifically, as described in Figure 1, a kind of a kind of embodiment of the data cleaning method for big data provided by the present applicationMay include:
S1: building Spark cluster, component needed for configuring.
In this example, the Spark cluster of scale needed for system is built first, and configure required component.Component needed for describedIncluding open source distributed memory system Tachyon, open source distributed resource management frame Mesos, resource manager YARN, big ruleMould parallel query engine BlinkDB etc., wherein Tachyon is distributed file system memory-based, is convenient for each task sharingData and the load for reducing JVM in calculating process;Mesos is a cluster manager dual system, services Zookeeper using Program CoordinationRealize cluster fault-tolerance;BlinkDB is a large-scale parallel query engine, allows to promote inquiry by tradeoff data precisionResponse time.Cluster is made to can satisfy the process demand of large-scale data in this way.
S2: establishing data cleansing rule base, and the data cleansing rule base includes at least business rule.
Wherein, data cleansing rule base is to establish for data diversity.The data cleansing rule base is firstIt include first business rule, the result of data cleansing will finally meet the demand of business, such as the range of numerical value, the classification of attributeDeng;It secondly include merging cleaning rule, which is used to handle similar duplicate data, and the application is according to these set of metadata of similar dataRenewal time sorts and assigns different weights, renewal time it is closer data weight it is bigger, so that a plurality of record that repeats be mergedFor the highest result information of confidence level.It finally include missing data processing rule, for the processing method master of shortage of data situationIt is divided into service related data and non-traffic related data, needs for service related data according at business ruleReason, the method that non-traffic related data then utilizes same inheritance of attribute is filled, and is made a mark convenient for subsequent update.
S3: according to the cleaning rule library, pre-processing to that need to clean data, obtains pretreated to clean numberAccording to.
It is described according to the cleaning rule library in this example, pretreated mode is carried out to data need to be cleaned, may include:
S301: according to the data formatization rule in the cleaning rule library, the data that need to clean are formatted.
Wherein, described to the mode that need to be cleaned data and be formatted, it may include: according to the data formatRule, carry out data normalization processing, using forbidden character alternative by data formatization and standardization etc..
S302: approximately duplicated data detection and cleaning are carried out to the data that need to clean.
S303: error value detection and cleaning are carried out to the data that need to clean.
It is wherein, described that approximately duplicated data detection and cleaning are carried out to the data that need to clean, comprising:
S3031: keyword is selected from the attribute of the every data that need to clean data.
S3032: according to the importance of the keyword, the data that need to clean are ranked up.
S3033: using the sliding window of predefined size, data in cleaning data and calculation window are needed described in slide collectionSimilarity detects approximately duplicated data.
S3034: the approximately duplicated data is cleaned.
Wherein, the mode cleaned to the approximately duplicated data may include:
S30341: according to the renewal time of the approximately duplicated data, the approximately duplicated data is ranked up and is assignedDifferent weights is given, the time, the more close weight was bigger;
S30342: a plurality of repeated data is merged into the highest result of confidence level.
In specific implementation process, a series of keyword, the data in data set will be selected in the attribute in every dataIt is successively sorted according to the importance of these keywords, comes similar data as far as possible together, use fixed size N'sSliding window, the interior similarity recorded of calculation window.Mobile sliding window is completed until all data calculate.
After approximately duplicated data detection is completed, the approximately duplicated data cleaning to detecting is needed, first according to dataRenewal time be ranked up and assign different weights, the closer weight of default time is bigger, wherein R={ v1,v2,...vnPresentation-entity attribute set, n indicate attribute quantity.
Final result are as follows:
Wherein, m indicates the quantity of approximately duplicated data, and λ indicates different weights, the attribute of maximum value is left cleaningResult.
It is described that error value detection and cleaning are carried out to the data that need to clean in this example, may include:
The data attribute constraint condition that function defines is relied on using condition, detects the error value that need to clean data, andThe error value is modified.
Wherein, error value refers to that apparent error occurs for some attribute in record, and it is wrong that this mistake is often as inputIt caused by accidentally, and is affected for the quality of data big data field is very common, in this example, function pair is relied on using conditionIt is handled.Condition, which relies on function, can reflect out the relevance between attribute, be data constraint under certain condition, energyIt is enough effectively to correct mistake data.
A relational model is let R be, it is (R that existence condition, which relies on function representation, on Ra:X→Y,TP)。
Wherein: RaIndicate the attribute of R, X, Y are attribute RaSubset;X → Y is functional dependencies;TPIt is X and Y in correspondenceConstraint on attribute.
It can use the error value in the attribute constraint correction data that above-mentioned condition dependence function defines.
S4: according to the business rule and the pending data, Job division is carried out, the more of the Spark cluster are obtainedA cleaning Job, each described cleaning Job are mapped to specific business demand.
Fig. 3 is the structural schematic diagram for the data cleansing dividing elements that the application one embodiment provides.As shown in figure 3, fromInitial data starts, and carries out Job division according to business rule, each cleaning Job unit is mapped to specific business demand, rawAt the relationship/object mapping for meeting business demand.The complete data of preliminary treatment generate a new tables of data.Subsequent operation baseThe Job quantity in entire cleaning process is reduced in new tables of data for the corresponding Job processing of each tables of data.
S5: by the pretreated data cleansing task that need to clean data, correspondence is assigned to the Spark clusterIn multiple cleaning Job, each cleaning Job pretreated needs to clean according to the requirement of the data cleansing rule base to describedData are cleaned using tree-like cleaning structure, obtain final wash result.
It is described to use tree-like cleaning structure to clean the pretreated data that clean in this example, it obtains mostThe mode of whole wash result may include:
S501: each cleaning Job, according to the data cleansing rule base, tentatively completes corresponding data cleansing and appointsBusiness.
S502: it after the completion of the data cleansing task of the multiple cleaning Job is preliminary, generates and corresponds to each cleaning Job'sDifferent data table.
S503: if after the completion of the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data setWithout cleaning again, then the elasticity distribution formula data set is stored in the different data table according to business demand, is obtainedFinal wash result.
S504: if after the completion of the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data setIt needs to clean again, then the elasticity distribution formula data set is assigned to corresponding cleaning Job again and is cleaned again, untilObtained new elasticity distribution formula data set without handling again, then by the new elasticity distribution formula data set according to business needIt asks and is stored in the different data table, obtain final wash result.
Fig. 4 is the cleaning structure signal cleaned using tree-like cleaning structure that the application one embodiment providesFigure.As shown in figure 4, cleaning task is assigned in several cleanings Job first, each Job meets the requirement in cleaning rule library,But it is likely to generate new dirty data after each Job has been handled.In order to further clean these data, simultaneouslyThe processing time is reduced, therefore elasticity distribution formula data set RDD before is reused, corresponding Job is assigned to and always carries out clearlyIt washes, if data without being directly stored as final result if cleaning again.And so on, until data are without again cleaning, according toBusiness demand stores data in different tables of data.:
Fig. 2 is the cleaning process figure for the data cleansing that the application one embodiment provides.Flow chart as shown in Figure 2 is correspondingS3 step in above-described embodiment pre-defines data cleansing rule base into S5 step before data cleansing starts.In numberIt is cleaned according to the data cleansing rule base predetermined in cleaning process, is installed.As shown in Fig. 2, the data cleansingProcess includes data prediction, divides cleaning Job, the corresponding data cleansing of different cleaning Job progress, the cleaning in whole flow processRule is all made of the data cleansing rule base predetermined, when business demand, data format etc. change, it is only necessary toNew data demand can be suitable for by changing corresponding cleaning rule, can have stronger versatility and adaptability.
Using a kind of embodiment for data cleaning method for big data that the various embodiments described above provide, firstly,Data cleansing rule base is defined before big data cleaning, data cleansing result meets the definition of rule base.When business demand, dataWhen format etc. changes, it is only necessary to change corresponding cleaning rule and just be applicable to new data demand.With stronger logicalWith property and adaptability.Secondly, the present invention carries out data cleansing using open source Distributed Architecture Spark, Spark is memory-basedData processing shelf, because the speed ratio of access data access speed in disk is fast in memory, it is possible to effectively mentionHigh data cleansing efficiency.Simultaneously as establishing data cleansing rule base, cleaned according to fixed rule, without artificialIntervene, reduces the time of artificial treatment, can be further improved data cleansing efficiency.In addition, data cleansing module is mutually onlyVertical, the modification and increase of each partial function do not interfere with the operation of other modules, have good scalability.
Device that above-described embodiment illustrates or module etc. can specifically realize by computer chip or entity, or by havingThere is the product of certain function to realize.For convenience of description, it is divided into various modules when description apparatus above with function to retouch respectivelyIt states.Certainly, the function of each module can be realized in the same or multiple software and or hardware when implementing the application,The module for realizing same function can be realized by the combination of multiple submodule etc..Installation practice described above is onlySchematically, for example, the division of the module, only a kind of logical function partition, can there is other draw in actual implementationThe mode of dividing, such as multiple module or components can be combined or can be integrated into another system, or some features can be ignored,Or it does not execute.
The application can describe in the general context of computer-executable instructions executed by a computer, such as programModule.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, groupPart, data structure, class etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments,By executing task by the connected remote processing devices of communication network.In a distributed computing environment, program module canTo be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application canIt realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the applicationOn in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software productIt can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment(can be personal computer, mobile terminal, server or the network equipment etc.) executes each embodiment of the application or implementationMethod described in certain parts of example.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodimentDividing may refer to each other, and each embodiment focuses on the differences from other embodiments.The application can be used for crowdIn mostly general or special purpose computing system environments or configuration.Such as: personal computer, server computer, handheld device orPortable device, laptop device, multicomputer system, microprocessor-based system, set top box, programmable electronics setStandby, network PC, minicomputer, mainframe computer, distributed computing environment including any of the above system or equipment etc..
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation andVariation is without departing from spirit herein, it is desirable to which the attached claims include these deformations and change without departing from the application'sSpirit.