Movatterモバイル変換


[0]ホーム

URL:


CN106599244B - General original log cleaning device and method - Google Patents

General original log cleaning device and method
Download PDF

Info

Publication number
CN106599244B
CN106599244BCN201611183585.0ACN201611183585ACN106599244BCN 106599244 BCN106599244 BCN 106599244BCN 201611183585 ACN201611183585 ACN 201611183585ACN 106599244 BCN106599244 BCN 106599244B
Authority
CN
China
Prior art keywords
log
cleaning
metadata
storage
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611183585.0A
Other languages
Chinese (zh)
Other versions
CN106599244A (en
Inventor
张亚军
田文宝
夏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Feihu Information Technology Tianjin Co Ltd
Original Assignee
Feihu Information Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feihu Information Technology Tianjin Co LtdfiledCriticalFeihu Information Technology Tianjin Co Ltd
Priority to CN201611183585.0ApriorityCriticalpatent/CN106599244B/en
Publication of CN106599244ApublicationCriticalpatent/CN106599244A/en
Application grantedgrantedCritical
Publication of CN106599244BpublicationCriticalpatent/CN106599244B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a general original log cleaning device, which comprises a variable storage module, for storing metadata corresponding to each type of log, regular expressions and matched fields corresponding to each metadata; a configuration module; and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage. The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage.

Description

General original log cleaning device and method
Technical Field
The invention relates to the technical field of big data processing, in particular to a general original log cleaning device and method.
Background
When log analysis is performed, the data of the log is unorganized, or it is said that the data of the log is not all intended. The data inside needs to be cleaned, i.e. the character strings inside are filtered and structured.
Some large internet companies have various logs, the logs all need to be cleaned, the data of the log amount is huge, and the log occupies about a storage space of several t per day, and 2 problems exist in the log: firstly, the logs are more, each type of log needs to be cleaned, if each log is specially and independently processed, the time is wasted, and secondly, the second problem is that the large amount of the logs occupies large space resources, and the network io consumed when the logs are read again is high.
Disclosure of Invention
The invention aims at overcoming the technical defects in the prior art, and provides a general original log cleaning method for completing cleaning work of different logs by flexible device custom configuration.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a general original log cleaning device comprises,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, wherein the cleaning tasks are in one-to-one correspondence with metadata;
and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage.
The configuration is stored using a zookeeper.
A general original log cleaning method, comprising,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task;
and identifying corresponding metadata according to the log type, completing the cleaning step by adopting a mapreduce program according to the cleaning task configuration, and carrying out preset storage.
The configuration is stored using a zookeeper.
In the cleaning step, the mapreduce program automatically judges the number of the reduce according to the size of the input data.
The data to be flushed is stored in the hdfs directory.
Compared with the prior art, the invention has the beneficial effects that:
the invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Drawings
Fig. 1 is a flow chart of a general original log cleaning method according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
To reduce the volume of data files, the compression currently most in use is lzo compression and snappy compression. Hadoop is a large data platform architecture of distributed storage and distributed computation, by means of the platform, the irregular log is structured through a mapreduce program and then stored into hdfs for later use according to a custom storage format and a compression format. The method overcomes the defects of the prior art that different log processing is carried out on logs synchronized according to service requirements and the code repetition rate is high, and reduces the workload of developers.
The general original log cleaning device comprises a variable storage module, a configuration module and a cleaning module, wherein,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the regular expression is stored in the variable storage module and is stored separately from the variables, and the function of the regular expression is to acquire the required fields, so that correctness must be ensured, for example:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/hdpb.gif\\?(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
the fields within brackets representing the fields to be extracted are classified, such as ip field, time field, parameter field, ua field. The metadata is set to be a plurality of according to the type of the data to be cleaned and the cleaning targets, and the reasonable metadata is selected according to the service requirements. The adoption of metadata simplifies the data model to be cleaned, and can realize the quick configuration of similar or approximate cleaning.
The configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, and the cleaning tasks are in one-to-one correspondence with metadata; the various cleaning requirements are directly and specifically tasked and stored, and the storage compression format and other necessary factors corresponding to each task are called, so that the corresponding cleaning process can be realized by calling the matched task.
The cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage.
The sum variable storage module is stored by adopting a zookeeper, and the configurations are stored by adopting the zookeeper, wherein the stored catalog format is as follows:
(1) First level directory/etl
(2) The secondary catalog/etl/$ task$ task represents a specific cleaning task name, each cleaning task being unique,
(3) Three levels of directories/etl/$task/field,/etl/$task/outformat,/etl/$task/regexp, such as regular expressions, are deposited to/etl/$task/field, and variables are stored at/etl/$task/regexp. Setting a storage format and a compression format in/etl/$ task/output,
the third level of catalogue is most important, and the catalogue has stored variable values and configuration format storage
The storage format of regexp is consistent with the number of groups of the regular expression, one-to-one correspondence is realized, and the type corresponding to each group is stored, wherein the types are three ips, strs and common.
The storage format of the file is as uid, vid, pid, the fields are separated by commas,
the storage format of the outformat has only 2 class name values, and the class name values are separated by commas to respectively represent a compression format and a storage format.
The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Meanwhile, the invention also discloses a general original log cleaning method which comprises the following steps,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task; the configuration is stored using a zookeeper.
And identifying corresponding metadata according to the log type, completing the cleaning step by adopting a Mapreduce program according to the cleaning task configuration, and performing preset storage, and simultaneously, automatically judging the number of the products by the Mapreduce program according to the size of the input data so as to produce a block file with a proper size.
The number of reduce= (total size of input data/size of hdfs block 3 (default block size is 128 m)) +1
Where 3 is the compression ratio, this value can be adjusted to make the size of the file produced by each reduction as small as possible, according to the compression algorithm, slightly smaller than the block size.
Before the cleaning operation is submitted, the configuration information of the zookeeper is read according to the cleaning task information, the compression and storage configuration is found first and used for submitting the cleaning operation, and in the Mapreduce stage, the configuration information of the field of the zookeeper and the regular matching relation are read according to the cleaning task information, so that the cleaning of the data is completed. The principle of Mapreduce is not discussed here, but is not a key technical point of this invention, and only uses these techniques. The invention improves the development efficiency, automatically optimizes the size of the hdfs block and reduces the probability of mistakes made by some new staff.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (2)

CN201611183585.0A2016-12-202016-12-20General original log cleaning device and methodActiveCN106599244B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201611183585.0ACN106599244B (en)2016-12-202016-12-20General original log cleaning device and method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201611183585.0ACN106599244B (en)2016-12-202016-12-20General original log cleaning device and method

Publications (2)

Publication NumberPublication Date
CN106599244A CN106599244A (en)2017-04-26
CN106599244Btrue CN106599244B (en)2024-01-05

Family

ID=58600257

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201611183585.0AActiveCN106599244B (en)2016-12-202016-12-20General original log cleaning device and method

Country Status (1)

CountryLink
CN (1)CN106599244B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109359103A (en)*2018-09-042019-02-19河南智云数据信息技术股份有限公司A kind of data aggregate cleaning method and system
CN115509851A (en)*2022-09-142022-12-23易纳购科技(北京)有限公司Page monitoring method, device and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1908935A (en)*2006-08-012007-02-07华为技术有限公司Search method and system of a natural language
CN1983952A (en)*2005-12-142007-06-20中兴通讯股份有限公司Method and system for synchronizing network administration data in network optimizing system
CN104298771A (en)*2014-10-302015-01-21南京信息工程大学Massive web log data query and analysis method
CN104391989A (en)*2014-12-162015-03-04浪潮电子信息产业股份有限公司Distributed ETL (extract transform load) all-in-one machine system
CN105447099A (en)*2015-11-112016-03-30中国建设银行股份有限公司Log structured information extraction method and apparatus
CN105706045A (en)*2013-07-192016-06-22泰必高软件公司Semantics-oriented analysis of log message content
CN106021554A (en)*2016-05-302016-10-12北京奇艺世纪科技有限公司Log analysis method and device
CN106227862A (en)*2016-07-292016-12-14浪潮软件集团有限公司E-commerce data integration method based on distribution

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20140042428A (en)*2012-09-282014-04-07삼성전자주식회사Computing system and data management method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1983952A (en)*2005-12-142007-06-20中兴通讯股份有限公司Method and system for synchronizing network administration data in network optimizing system
CN1908935A (en)*2006-08-012007-02-07华为技术有限公司Search method and system of a natural language
CN105706045A (en)*2013-07-192016-06-22泰必高软件公司Semantics-oriented analysis of log message content
CN104298771A (en)*2014-10-302015-01-21南京信息工程大学Massive web log data query and analysis method
CN104391989A (en)*2014-12-162015-03-04浪潮电子信息产业股份有限公司Distributed ETL (extract transform load) all-in-one machine system
CN105447099A (en)*2015-11-112016-03-30中国建设银行股份有限公司Log structured information extraction method and apparatus
CN106021554A (en)*2016-05-302016-10-12北京奇艺世纪科技有限公司Log analysis method and device
CN106227862A (en)*2016-07-292016-12-14浪潮软件集团有限公司E-commerce data integration method based on distribution

Also Published As

Publication numberPublication date
CN106599244A (en)2017-04-26

Similar Documents

PublicationPublication DateTitle
US11068439B2 (en)Unsupervised method for enriching RDF data sources from denormalized data
CN102982075B (en)Support to access the system and method for heterogeneous data source
CN108616419B (en)Data packet acquisition and analysis system and method based on Docker
CN109710703A (en)A kind of generation method and device of genetic connection network
CN102567312A (en)Machine translation method based on distributive parallel computation framework
US9418241B2 (en)Unified platform for big data processing
CN112115113B (en)Data storage system, method, device, equipment and storage medium
CN103927331A (en)Data querying method, data querying device and data querying system
KR20150092586A (en)Method and Apparatus for Processing Exploding Data Stream
CN106294745A (en)Big data cleaning method and device
CN106777142A (en)Service layer's system and method based on mobile Internet mass data
Kim et al.A study on utilization of spatial information in heterogeneous system based on apache nifi
Bala et al.P-ETL: Parallel-ETL based on the MapReduce paradigm
CN114817389A (en)Data processing method, data processing device, storage medium and electronic equipment
CN105786941B (en)Information mining method and device
CN110825453A (en)Data processing method and device based on big data platform
CN106599244B (en)General original log cleaning device and method
CN105468770A (en)Data processing method and system
CN110019152A (en)A kind of big data cleaning method
CN111324608A (en)Model multiplexing method, device, equipment and storage medium
CN106599241A (en)Big data visual management method for GIS software
US11061736B2 (en)Multiple parallel reducer types in a single map-reduce job
Chen et al.Related technologies
CN106843822B (en)Execution code generation method and equipment
CN112860954A (en)Real-time computing method and real-time computing system

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp