Disclosure of Invention
The invention aims at overcoming the technical defects in the prior art, and provides a general original log cleaning method for completing cleaning work of different logs by flexible device custom configuration.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a general original log cleaning device comprises,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, wherein the cleaning tasks are in one-to-one correspondence with metadata;
and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage.
The configuration is stored using a zookeeper.
A general original log cleaning method, comprising,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task;
and identifying corresponding metadata according to the log type, completing the cleaning step by adopting a mapreduce program according to the cleaning task configuration, and carrying out preset storage.
The configuration is stored using a zookeeper.
In the cleaning step, the mapreduce program automatically judges the number of the reduce according to the size of the input data.
The data to be flushed is stored in the hdfs directory.
Compared with the prior art, the invention has the beneficial effects that:
the invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
To reduce the volume of data files, the compression currently most in use is lzo compression and snappy compression. Hadoop is a large data platform architecture of distributed storage and distributed computation, by means of the platform, the irregular log is structured through a mapreduce program and then stored into hdfs for later use according to a custom storage format and a compression format. The method overcomes the defects of the prior art that different log processing is carried out on logs synchronized according to service requirements and the code repetition rate is high, and reduces the workload of developers.
The general original log cleaning device comprises a variable storage module, a configuration module and a cleaning module, wherein,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the regular expression is stored in the variable storage module and is stored separately from the variables, and the function of the regular expression is to acquire the required fields, so that correctness must be ensured, for example:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/hdpb.gif\\?(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
the fields within brackets representing the fields to be extracted are classified, such as ip field, time field, parameter field, ua field. The metadata is set to be a plurality of according to the type of the data to be cleaned and the cleaning targets, and the reasonable metadata is selected according to the service requirements. The adoption of metadata simplifies the data model to be cleaned, and can realize the quick configuration of similar or approximate cleaning.
The configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, and the cleaning tasks are in one-to-one correspondence with metadata; the various cleaning requirements are directly and specifically tasked and stored, and the storage compression format and other necessary factors corresponding to each task are called, so that the corresponding cleaning process can be realized by calling the matched task.
The cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage.
The sum variable storage module is stored by adopting a zookeeper, and the configurations are stored by adopting the zookeeper, wherein the stored catalog format is as follows:
(1) First level directory/etl
(2) The secondary catalog/etl/$ task$ task represents a specific cleaning task name, each cleaning task being unique,
(3) Three levels of directories/etl/$task/field,/etl/$task/outformat,/etl/$task/regexp, such as regular expressions, are deposited to/etl/$task/field, and variables are stored at/etl/$task/regexp. Setting a storage format and a compression format in/etl/$ task/output,
the third level of catalogue is most important, and the catalogue has stored variable values and configuration format storage
The storage format of regexp is consistent with the number of groups of the regular expression, one-to-one correspondence is realized, and the type corresponding to each group is stored, wherein the types are three ips, strs and common.
The storage format of the file is as uid, vid, pid, the fields are separated by commas,
the storage format of the outformat has only 2 class name values, and the class name values are separated by commas to respectively represent a compression format and a storage format.
The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Meanwhile, the invention also discloses a general original log cleaning method which comprises the following steps,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task; the configuration is stored using a zookeeper.
And identifying corresponding metadata according to the log type, completing the cleaning step by adopting a Mapreduce program according to the cleaning task configuration, and performing preset storage, and simultaneously, automatically judging the number of the products by the Mapreduce program according to the size of the input data so as to produce a block file with a proper size.
The number of reduce= (total size of input data/size of hdfs block 3 (default block size is 128 m)) +1
Where 3 is the compression ratio, this value can be adjusted to make the size of the file produced by each reduction as small as possible, according to the compression algorithm, slightly smaller than the block size.
Before the cleaning operation is submitted, the configuration information of the zookeeper is read according to the cleaning task information, the compression and storage configuration is found first and used for submitting the cleaning operation, and in the Mapreduce stage, the configuration information of the field of the zookeeper and the regular matching relation are read according to the cleaning task information, so that the cleaning of the data is completed. The principle of Mapreduce is not discussed here, but is not a key technical point of this invention, and only uses these techniques. The invention improves the development efficiency, automatically optimizes the size of the hdfs block and reduces the probability of mistakes made by some new staff.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.