Movatterモバイル変換


[0]ホーム

URL:


CN109753496A - A kind of data cleaning method for big data - Google Patents

A kind of data cleaning method for big data
Download PDF

Info

Publication number
CN109753496A
CN109753496ACN201811424289.4ACN201811424289ACN109753496ACN 109753496 ACN109753496 ACN 109753496ACN 201811424289 ACN201811424289 ACN 201811424289ACN 109753496 ACN109753496 ACN 109753496A
Authority
CN
China
Prior art keywords
data
cleaning
need
job
clean
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811424289.4A
Other languages
Chinese (zh)
Inventor
李阳
左磊
尹熙
张良晖
蔡劼
桑晓龙
陆世龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tian He Jie (suzhou) Data Ltd By Share Ltd
Original Assignee
Tian He Jie (suzhou) Data Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tian He Jie (suzhou) Data Ltd By Share LtdfiledCriticalTian He Jie (suzhou) Data Ltd By Share Ltd
Priority to CN201811424289.4ApriorityCriticalpatent/CN109753496A/en
Publication of CN109753496ApublicationCriticalpatent/CN109753496A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

The application provides a kind of data cleaning method for big data, which comprises builds Spark cluster, component needed for configuring;Establish data cleansing rule base, including business rule;It according to the cleaning rule library, is pre-processed to data need to be cleaned, obtains pretreated to clean data;According to the business rule and the pending data, Job division is carried out, obtains multiple cleaning Job of the Spark cluster, each described cleaning Job is mapped to specific business demand;By data cleansing task, correspondence is assigned in multiple cleaning Job of the Spark cluster, and each cleaning Job uses tree-like cleaning structure to clean the pretreated data that need to clean, obtain final wash result according to the requirement of data cleansing rule base.It using each embodiment of the application, can improve the cleaning efficiency and accuracy, make data cleaning method that there is stronger versatility and adaptability.

Description

A kind of data cleaning method for big data
Technical field
This application involves big data technical field, in particular to a kind of data cleaning method for big data.
Background technique
It is increasingly developed and universal with computer technology, when today's society enters big data from the information ageGeneration.Everyone every act and every move is collected in a large amount of data of generation, these data by different information systems.Enterprise needsAccording to the different behaviors and hobby of the data analysis user being collected into, better service is provided for user, but when the data collectedThe quality of data that most of information systems are all unable to ensure its collection when reaching TB, PB or even EB rank is able to satisfy the need of userIt asks.The factor for influencing the quality of data mainly has: shortage of data, data are out-of-date, error in data, Data duplication, data collision etc..ForThe quality of data is improved, data cleansing technology is most important.Data cleansing provides the data service of high quality for enterprise operation,Also reliable data basis is provided for data mining.
Data cleansing refers to by the mistake or redundancy in the detection and transformation elimination data to data, to be metIt is required that quality data.In the prior art, the main means that data cleansing uses include: (1) based on data communityConstraint handles data, but this method need design constraint function in advance and if constraint function consider comprehensively if having canPart useful information can be accidentally deleted, shortage of data is caused.Human intervention is then needed in order to improve the validity of this method, that is, is being cleanedIt is cleaned when can not be handled such as constraint function in the process by the feedback operation of people, due to increasing human intervention data cleansingAccuracy can beat raising but data cleansing consumed by the time can also greatly increase, and excessively rely on people subjectivity sentenceIt is disconnected.(2) method that machine learning is used during data cleansing, i.e. precondition go out to be used for the machine learning mould of data cleansingType, during subsequent data cleansing, continuous cumulative learning.This mode eliminates human intervention, improves the effect of cleaningRate, but accurate rate is declined, it is higher simultaneously for model needs, when data content format more diversification, clean matterAmount will also be a greater impact.
The prior art at least has the following technical problems: cleaning efficiency is lower, and cleaning accuracy is lower, can not be suitable for needleData cleansing to big data.
Summary of the invention
The purpose of the embodiment of the present application is to provide a kind of data cleaning method for big data, to improve cleaning efficiency,Cleaning accuracy is improved, makes data cleaning method that there is versatility and adaptability, to be suitable for the cleaning of big data.
The embodiment of the present application provides a kind of data cleaning method for big data and is achieved in that
A kind of data cleaning method for big data, which comprises
Spark cluster is built, component needed for configuring;
Data cleansing rule base is established, the data cleansing rule base includes at least business rule;
It according to the cleaning rule library, is pre-processed to data need to be cleaned, obtains pretreated to clean data;
According to the business rule and the pending data, Job division is carried out, the multiple of the Spark cluster are obtainedJob is cleaned, each described cleaning Job is mapped to specific business demand;
By the pretreated data cleansing task that need to clean data, correspondence is assigned to the more of the Spark clusterIn a cleaning Job, each cleaning Job pretreated need to clean number to described according to the requirement of the data cleansing rule baseIt is cleaned according to using tree-like cleaning structure, obtains final wash result.
It is described to use tree-like cleaning structure to clean the pretreated data that clean in preferred embodiment,Obtain the mode of final wash result, comprising:
Each cleaning Job tentatively completes corresponding data cleansing task according to the data cleansing rule base;
After the completion of the data cleansing task of the multiple cleaning Job is preliminary, the different numbers for corresponding to each cleaning Job are generatedAccording to table;
If obtained elasticity distribution formula data set is not necessarily to after the completion of the data cleansing task of the multiple cleaning Job is preliminaryIt cleans, is then stored in the elasticity distribution formula data set in the different data table according to business demand again, obtain finalWash result;
After the completion of if the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data set needsIt cleans again, then the elasticity distribution formula data set is assigned to corresponding cleaning Job again and is cleaned again, until obtainingNew elasticity distribution formula data set without handling again, then the new elasticity distribution formula data set is deposited according to business demandStorage obtains final wash result in the different data table.
It is described according to the cleaning rule library in preferred embodiment, it is pre-processed to data need to be cleaned, comprising:
According to the data formatization rule in the cleaning rule library, the data that need to clean are formatted;
Approximately duplicated data detection and cleaning are carried out to the data that need to clean;
Error value detection and cleaning are carried out to the data that need to clean.
It is described that approximately duplicated data detection and cleaning are carried out to the data that need to clean in preferred embodiment, comprising:
Keyword is selected from the attribute of the every data that need to clean data;
According to the importance of the keyword, the data that need to clean are ranked up;
Using the sliding window of predefined size, needed described in slide collection cleaning data and calculation window in data it is similarDegree, detects approximately duplicated data;
The approximately duplicated data is cleaned.
In preferred embodiment, the mode that the approximately duplicated data is cleaned, comprising:
According to the renewal time of the approximately duplicated data, the approximately duplicated data is ranked up and is assigned differentWeight, the time, the more close weight was bigger;
A plurality of repeated data is merged into the highest result of confidence level.
It is described that error value detection and cleaning are carried out to the data that need to clean in preferred embodiment, comprising:
The data attribute constraint condition that function defines is relied on using condition, detects the error value that need to clean data, andThe error value is modified.
In preferred embodiment, the required component includes:
Open source distributed memory system, open source distributed resource management frame, resource manager, large-scale parallel inquiry are drawnIt holds up.
In preferred embodiment, the cleaning rule library further includes merging cleaning rule, missing data processing rule.
It is described to the mode that need to be cleaned data and be formatted in preferred embodiment, comprising:
Data normalization processing;
Using forbidden character alternative by data format and standardization.
Using a kind of data cleaning method for big data provided by the embodiments of the present application, firstly, being cleaned in big dataBefore define data cleansing rule base, data cleansing result meets the definition of rule base.When business demand, data format etc. occurWhen change, it is only necessary to change corresponding cleaning rule and just be applicable to new data demand.With stronger versatility and adaptationProperty.Secondly, the present invention carries out data cleansing using open source Distributed Architecture Spark, Spark is data processing frame memory-basedFrame, because the speed ratio of access data access speed in disk is fast in memory, it is possible to effectively improve data cleansingEfficiency.Simultaneously as establishing data cleansing rule base, cleaned according to fixed rule, is not necessarily to human intervention, reducesThe time of artificial treatment can be further improved data cleansing efficiency.In addition, data cleansing module is mutually indepedent, each partThe modification and increase of function all do not interfere with the operation of other modules, have good scalability.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only thisThe some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor propertyUnder, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of method flow diagram for data cleaning method for big data that the application one embodiment provides;
Fig. 2 is the cleaning process figure for the data cleansing that the application one embodiment provides;
Fig. 3 is the structural schematic diagram for the data cleansing dividing elements that the application one embodiment provides;
Fig. 4 is the cleaning structure signal cleaned using tree-like cleaning structure that the application one embodiment providesFigure.
Specific embodiment
The embodiment of the present application provides a kind of data cleaning method for big data.
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application realityThe attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementationExample is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is commonThe application protection all should belong in technical staff's every other embodiment obtained without creative effortsRange.
Fig. 1 is a kind of a kind of method flow signal of embodiment of herein described data cleaning method for big dataFigure.Although this application provides as the following examples or method operating procedure shown in the drawings or apparatus structure, based on conventionalIt either may include more or less operating procedure or module list in the method or device without creative laborMember.In the step of there is no necessary causalities in logicality or structure, the execution sequence of these steps or the module of deviceStructure is not limited to the embodiment of the present application or execution shown in the drawings sequence or modular structure.The method or modular structureDevice or end product in practice according to embodiment or method shown in the drawings or modular structure in application, can carry out suitableSequence execute or it is parallel execute (such as parallel processor or multiple threads environment, even include distributed treatment realityApply environment).
Specifically, as described in Figure 1, a kind of a kind of embodiment of the data cleaning method for big data provided by the present applicationMay include:
S1: building Spark cluster, component needed for configuring.
In this example, the Spark cluster of scale needed for system is built first, and configure required component.Component needed for describedIncluding open source distributed memory system Tachyon, open source distributed resource management frame Mesos, resource manager YARN, big ruleMould parallel query engine BlinkDB etc., wherein Tachyon is distributed file system memory-based, is convenient for each task sharingData and the load for reducing JVM in calculating process;Mesos is a cluster manager dual system, services Zookeeper using Program CoordinationRealize cluster fault-tolerance;BlinkDB is a large-scale parallel query engine, allows to promote inquiry by tradeoff data precisionResponse time.Cluster is made to can satisfy the process demand of large-scale data in this way.
S2: establishing data cleansing rule base, and the data cleansing rule base includes at least business rule.
Wherein, data cleansing rule base is to establish for data diversity.The data cleansing rule base is firstIt include first business rule, the result of data cleansing will finally meet the demand of business, such as the range of numerical value, the classification of attributeDeng;It secondly include merging cleaning rule, which is used to handle similar duplicate data, and the application is according to these set of metadata of similar dataRenewal time sorts and assigns different weights, renewal time it is closer data weight it is bigger, so that a plurality of record that repeats be mergedFor the highest result information of confidence level.It finally include missing data processing rule, for the processing method master of shortage of data situationIt is divided into service related data and non-traffic related data, needs for service related data according at business ruleReason, the method that non-traffic related data then utilizes same inheritance of attribute is filled, and is made a mark convenient for subsequent update.
S3: according to the cleaning rule library, pre-processing to that need to clean data, obtains pretreated to clean numberAccording to.
It is described according to the cleaning rule library in this example, pretreated mode is carried out to data need to be cleaned, may include:
S301: according to the data formatization rule in the cleaning rule library, the data that need to clean are formatted.
Wherein, described to the mode that need to be cleaned data and be formatted, it may include: according to the data formatRule, carry out data normalization processing, using forbidden character alternative by data formatization and standardization etc..
S302: approximately duplicated data detection and cleaning are carried out to the data that need to clean.
S303: error value detection and cleaning are carried out to the data that need to clean.
It is wherein, described that approximately duplicated data detection and cleaning are carried out to the data that need to clean, comprising:
S3031: keyword is selected from the attribute of the every data that need to clean data.
S3032: according to the importance of the keyword, the data that need to clean are ranked up.
S3033: using the sliding window of predefined size, data in cleaning data and calculation window are needed described in slide collectionSimilarity detects approximately duplicated data.
S3034: the approximately duplicated data is cleaned.
Wherein, the mode cleaned to the approximately duplicated data may include:
S30341: according to the renewal time of the approximately duplicated data, the approximately duplicated data is ranked up and is assignedDifferent weights is given, the time, the more close weight was bigger;
S30342: a plurality of repeated data is merged into the highest result of confidence level.
In specific implementation process, a series of keyword, the data in data set will be selected in the attribute in every dataIt is successively sorted according to the importance of these keywords, comes similar data as far as possible together, use fixed size N'sSliding window, the interior similarity recorded of calculation window.Mobile sliding window is completed until all data calculate.
After approximately duplicated data detection is completed, the approximately duplicated data cleaning to detecting is needed, first according to dataRenewal time be ranked up and assign different weights, the closer weight of default time is bigger, wherein R={ v1,v2,...vnPresentation-entity attribute set, n indicate attribute quantity.
Final result are as follows:
Wherein, m indicates the quantity of approximately duplicated data, and λ indicates different weights, the attribute of maximum value is left cleaningResult.
It is described that error value detection and cleaning are carried out to the data that need to clean in this example, may include:
The data attribute constraint condition that function defines is relied on using condition, detects the error value that need to clean data, andThe error value is modified.
Wherein, error value refers to that apparent error occurs for some attribute in record, and it is wrong that this mistake is often as inputIt caused by accidentally, and is affected for the quality of data big data field is very common, in this example, function pair is relied on using conditionIt is handled.Condition, which relies on function, can reflect out the relevance between attribute, be data constraint under certain condition, energyIt is enough effectively to correct mistake data.
A relational model is let R be, it is (R that existence condition, which relies on function representation, on Ra:X→Y,TP)。
Wherein: RaIndicate the attribute of R, X, Y are attribute RaSubset;X → Y is functional dependencies;TPIt is X and Y in correspondenceConstraint on attribute.
It can use the error value in the attribute constraint correction data that above-mentioned condition dependence function defines.
S4: according to the business rule and the pending data, Job division is carried out, the more of the Spark cluster are obtainedA cleaning Job, each described cleaning Job are mapped to specific business demand.
Fig. 3 is the structural schematic diagram for the data cleansing dividing elements that the application one embodiment provides.As shown in figure 3, fromInitial data starts, and carries out Job division according to business rule, each cleaning Job unit is mapped to specific business demand, rawAt the relationship/object mapping for meeting business demand.The complete data of preliminary treatment generate a new tables of data.Subsequent operation baseThe Job quantity in entire cleaning process is reduced in new tables of data for the corresponding Job processing of each tables of data.
S5: by the pretreated data cleansing task that need to clean data, correspondence is assigned to the Spark clusterIn multiple cleaning Job, each cleaning Job pretreated needs to clean according to the requirement of the data cleansing rule base to describedData are cleaned using tree-like cleaning structure, obtain final wash result.
It is described to use tree-like cleaning structure to clean the pretreated data that clean in this example, it obtains mostThe mode of whole wash result may include:
S501: each cleaning Job, according to the data cleansing rule base, tentatively completes corresponding data cleansing and appointsBusiness.
S502: it after the completion of the data cleansing task of the multiple cleaning Job is preliminary, generates and corresponds to each cleaning Job'sDifferent data table.
S503: if after the completion of the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data setWithout cleaning again, then the elasticity distribution formula data set is stored in the different data table according to business demand, is obtainedFinal wash result.
S504: if after the completion of the data cleansing task of the multiple cleaning Job is preliminary, obtained elasticity distribution formula data setIt needs to clean again, then the elasticity distribution formula data set is assigned to corresponding cleaning Job again and is cleaned again, untilObtained new elasticity distribution formula data set without handling again, then by the new elasticity distribution formula data set according to business needIt asks and is stored in the different data table, obtain final wash result.
Fig. 4 is the cleaning structure signal cleaned using tree-like cleaning structure that the application one embodiment providesFigure.As shown in figure 4, cleaning task is assigned in several cleanings Job first, each Job meets the requirement in cleaning rule library,But it is likely to generate new dirty data after each Job has been handled.In order to further clean these data, simultaneouslyThe processing time is reduced, therefore elasticity distribution formula data set RDD before is reused, corresponding Job is assigned to and always carries out clearlyIt washes, if data without being directly stored as final result if cleaning again.And so on, until data are without again cleaning, according toBusiness demand stores data in different tables of data.:
Fig. 2 is the cleaning process figure for the data cleansing that the application one embodiment provides.Flow chart as shown in Figure 2 is correspondingS3 step in above-described embodiment pre-defines data cleansing rule base into S5 step before data cleansing starts.In numberIt is cleaned according to the data cleansing rule base predetermined in cleaning process, is installed.As shown in Fig. 2, the data cleansingProcess includes data prediction, divides cleaning Job, the corresponding data cleansing of different cleaning Job progress, the cleaning in whole flow processRule is all made of the data cleansing rule base predetermined, when business demand, data format etc. change, it is only necessary toNew data demand can be suitable for by changing corresponding cleaning rule, can have stronger versatility and adaptability.
Using a kind of embodiment for data cleaning method for big data that the various embodiments described above provide, firstly,Data cleansing rule base is defined before big data cleaning, data cleansing result meets the definition of rule base.When business demand, dataWhen format etc. changes, it is only necessary to change corresponding cleaning rule and just be applicable to new data demand.With stronger logicalWith property and adaptability.Secondly, the present invention carries out data cleansing using open source Distributed Architecture Spark, Spark is memory-basedData processing shelf, because the speed ratio of access data access speed in disk is fast in memory, it is possible to effectively mentionHigh data cleansing efficiency.Simultaneously as establishing data cleansing rule base, cleaned according to fixed rule, without artificialIntervene, reduces the time of artificial treatment, can be further improved data cleansing efficiency.In addition, data cleansing module is mutually onlyVertical, the modification and increase of each partial function do not interfere with the operation of other modules, have good scalability.
Device that above-described embodiment illustrates or module etc. can specifically realize by computer chip or entity, or by havingThere is the product of certain function to realize.For convenience of description, it is divided into various modules when description apparatus above with function to retouch respectivelyIt states.Certainly, the function of each module can be realized in the same or multiple software and or hardware when implementing the application,The module for realizing same function can be realized by the combination of multiple submodule etc..Installation practice described above is onlySchematically, for example, the division of the module, only a kind of logical function partition, can there is other draw in actual implementationThe mode of dividing, such as multiple module or components can be combined or can be integrated into another system, or some features can be ignored,Or it does not execute.
The application can describe in the general context of computer-executable instructions executed by a computer, such as programModule.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, groupPart, data structure, class etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments,By executing task by the connected remote processing devices of communication network.In a distributed computing environment, program module canTo be located in the local and remote computer storage media including storage equipment.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application canIt realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the applicationOn in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software productIt can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment(can be personal computer, mobile terminal, server or the network equipment etc.) executes each embodiment of the application or implementationMethod described in certain parts of example.
Each embodiment in this specification is described in a progressive manner, the same or similar portion between each embodimentDividing may refer to each other, and each embodiment focuses on the differences from other embodiments.The application can be used for crowdIn mostly general or special purpose computing system environments or configuration.Such as: personal computer, server computer, handheld device orPortable device, laptop device, multicomputer system, microprocessor-based system, set top box, programmable electronics setStandby, network PC, minicomputer, mainframe computer, distributed computing environment including any of the above system or equipment etc..
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation andVariation is without departing from spirit herein, it is desirable to which the attached claims include these deformations and change without departing from the application'sSpirit.

Claims (9)

CN201811424289.4A2018-11-272018-11-27A kind of data cleaning method for big dataPendingCN109753496A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201811424289.4ACN109753496A (en)2018-11-272018-11-27A kind of data cleaning method for big data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201811424289.4ACN109753496A (en)2018-11-272018-11-27A kind of data cleaning method for big data

Publications (1)

Publication NumberPublication Date
CN109753496Atrue CN109753496A (en)2019-05-14

Family

ID=66403332

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201811424289.4APendingCN109753496A (en)2018-11-272018-11-27A kind of data cleaning method for big data

Country Status (1)

CountryLink
CN (1)CN109753496A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110555019A (en)*2019-09-122019-12-10成都中科大旗软件股份有限公司Data cleaning method based on service end
CN110837459A (en)*2019-11-072020-02-25广东省科技基础条件平台中心Big data-based operation performance analysis method and system
CN111026744A (en)*2019-12-112020-04-17新奥数能科技有限公司Data management method and device based on energy station system model framework
CN111522806A (en)*2020-04-262020-08-11陈文海 Big data cleaning and processing method, device, server and readable storage medium
CN112257995A (en)*2020-10-092021-01-22贵州省产品质量检验检测院 An Internet-based product quality risk monitoring sampling method and system
CN112631755A (en)*2020-12-302021-04-09上海高顿教育科技有限公司Data cleaning method and device based on event stream driving
CN113836131A (en)*2021-09-292021-12-24平安科技(深圳)有限公司Big data cleaning method and device, computer equipment and storage medium
CN113868237A (en)*2021-09-302021-12-31杭州数梦工场科技有限公司Data cleaning method and device
CN114328495A (en)*2021-12-312022-04-12陕西优百信息技术有限公司 Enterprise material cleaning service system and data cleaning method

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103336771A (en)*2013-04-022013-10-02江苏大学Data similarity detection method based on sliding window
EP3035211A1 (en)*2014-12-182016-06-22Business Objects Software Ltd.Visualizing large data volumes utilizing initial sampling and multi-stage calculations
CN106202569A (en)*2016-08-092016-12-07北京北信源软件股份有限公司A kind of cleaning method based on big data quantity
CN106294745A (en)*2016-08-102017-01-04东方网力科技股份有限公司Big data cleaning method and device
CN106528865A (en)*2016-12-022017-03-22航天科工智慧产业发展有限公司Quick and accurate cleaning method of traffic big data
CN107832451A (en)*2017-11-232018-03-23安徽科创智慧知识产权服务有限公司A kind of big data cleaning way of simplification
CN108563789A (en)*2018-04-282018-09-21成都致云科技有限公司Data cleaning method based on Spark frames and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103336771A (en)*2013-04-022013-10-02江苏大学Data similarity detection method based on sliding window
EP3035211A1 (en)*2014-12-182016-06-22Business Objects Software Ltd.Visualizing large data volumes utilizing initial sampling and multi-stage calculations
CN106202569A (en)*2016-08-092016-12-07北京北信源软件股份有限公司A kind of cleaning method based on big data quantity
CN106294745A (en)*2016-08-102017-01-04东方网力科技股份有限公司Big data cleaning method and device
CN106528865A (en)*2016-12-022017-03-22航天科工智慧产业发展有限公司Quick and accurate cleaning method of traffic big data
CN107832451A (en)*2017-11-232018-03-23安徽科创智慧知识产权服务有限公司A kind of big data cleaning way of simplification
CN108563789A (en)*2018-04-282018-09-21成都致云科技有限公司Data cleaning method based on Spark frames and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SALIMA BENBERNOU 等: "Enhancing data quality by cleaning inconsistent big RDF data", 《2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)》*
金翰伟: "基于Spark的大数据清洗框架设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》*

Cited By (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110555019B (en)*2019-09-122023-03-24成都中科大旗软件股份有限公司Data cleaning method based on service end
CN110555019A (en)*2019-09-122019-12-10成都中科大旗软件股份有限公司Data cleaning method based on service end
CN110837459A (en)*2019-11-072020-02-25广东省科技基础条件平台中心Big data-based operation performance analysis method and system
CN111026744A (en)*2019-12-112020-04-17新奥数能科技有限公司Data management method and device based on energy station system model framework
CN111522806A (en)*2020-04-262020-08-11陈文海 Big data cleaning and processing method, device, server and readable storage medium
CN111522806B (en)*2020-04-262023-07-07上海聚均科技有限公司Big data cleaning processing method, device, server and readable storage medium
CN112257995A (en)*2020-10-092021-01-22贵州省产品质量检验检测院 An Internet-based product quality risk monitoring sampling method and system
CN112631755A (en)*2020-12-302021-04-09上海高顿教育科技有限公司Data cleaning method and device based on event stream driving
CN113836131A (en)*2021-09-292021-12-24平安科技(深圳)有限公司Big data cleaning method and device, computer equipment and storage medium
CN113836131B (en)*2021-09-292024-02-02平安科技(深圳)有限公司Big data cleaning method and device, computer equipment and storage medium
CN113868237A (en)*2021-09-302021-12-31杭州数梦工场科技有限公司Data cleaning method and device
CN113868237B (en)*2021-09-302025-03-28杭州数梦工场科技有限公司 Data cleaning method and device
CN114328495A (en)*2021-12-312022-04-12陕西优百信息技术有限公司 Enterprise material cleaning service system and data cleaning method

Similar Documents

PublicationPublication DateTitle
CN109753496A (en)A kind of data cleaning method for big data
US11327935B2 (en)Intelligent data quality
EP2184680B1 (en)An infrastructure for parallel programming of clusters of machines
Van der AalstDecomposing Petri nets for process mining: A generic approach
CN105335814B (en)Online big data intelligent cloud auditing method and system
CN105976109A (en)Intelligent auditing method and system based on big data
CA2386272A1 (en)Collaborative design
CN107301205A (en)A kind of distributed Query method in real time of big data and system
CN105469204A (en)Reassembling manufacturing enterprise integrated evaluation system based on deeply integrated big data analysis technology
Stahl et al.Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks
CN115221337B (en) Data weaving processing method, device, electronic device and readable storage medium
CN104820708A (en)Cloud computing platform based big data clustering method and device
US20150236934A1 (en)Metrics management and monitoring system for service transition and delivery management
CN110825526A (en)Distributed scheduling method and device based on ER relationship, equipment and storage medium
Chen et al.NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
CN115328918A (en) A flexible report generation method, device, electronic device and storage medium
Martinez et al.Deep learning evolutionary optimization for regression of rotorcraft vibrational spectra
CN118069359A (en)Data processing method, device, computer equipment and storage medium
CN116151632A (en)Data architecture method
CN115689463A (en)Enterprise standing book database management system in rare earth industry
CN109857593B (en)Data center log missing data recovery method
Lagerström et al.A methodology for operationalizing enterprise IT architecture and evaluating its modifiability
Betke et al.Classifying temporal characteristics of job i/o using machine learning techniques
Glimcher et al.A performance prediction framework for grid-based data mining applications
Ju et al.Data Cleaning Optimization for Grain Big Data Processing using Task Merging

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20190514

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp