Disclosure of Invention
The invention provides a data analysis method, a data analysis device and a storage medium, and mainly aims to audit data before being stored in a database and improve the efficiency of storing the data in the database.
In order to achieve the above object, the present invention provides a data parsing method, including the steps of:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
Preferably, if the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a preset client, after receiving a template creating request submitted by the preset client in response to the warning information, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing a difference part between the new template and the original template into a log file.
Preferably, the pre-treatment comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
Preferably, the partition establishing rule includes determining the number of the newly-built partitions corresponding to the current first differential speed according to a mapping relationship between the first differential speed and the number of the newly-built partitions, which is determined in advance; and
and the partition logout rule comprises the step of determining the number of the logged-out partitions corresponding to the current second differential speed according to the mapping relation between the predetermined second differential speed and the number of the logged-out partitions.
Preferably, if the number of the data to be analyzed in the cache partition is greater than or equal to a first preset number, a new cache partition is created according to a predetermined partition establishment rule; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the corresponding number of the cache partitions according to a predetermined partition canceling rule.
In addition, to achieve the above object, the present invention further provides an electronic device, which includes a memory and a processor, wherein the memory stores a data analysis program operable on the processor, and the data analysis program implements the following steps when executed by the processor:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
Preferably, if the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a preset client, after receiving a template creating request submitted by the preset client in response to the warning information, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing a difference part between the new template and the original template into a log file.
Preferably, the pre-treatment comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
Preferably, if the number of the data to be analyzed in the cache partition is greater than or equal to a first preset number, a new cache partition is created according to a predetermined partition establishment rule; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the corresponding number of the cache partitions according to a predetermined partition canceling rule.
Compared with the prior art, the method and the device have the advantages that the network data are acquired from the preset website by using the pre-configured network data acquisition script, different data identifiers are respectively added to each data to be analyzed, the mapping relation between the data to be analyzed and the identifiers is established, and the data added with the identifiers are cached; uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and analyzing the data to be analyzed of each cache partition, matching the analyzed data with a preset template, and if the data is successfully matched with the preset template, storing the analyzed data in a database, thereby effectively realizing the examination of the data before being stored in the database and improving the efficiency of storing the data in the database.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
In this embodiment, the data analysis method includes:
step S10, acquiring network data from a preset website in real time or at regular time by using a pre-configured network data acquisition script, preprocessing the acquired network data, and saving the preprocessed network data as data to be analyzed into a specified network data file.
The pretreatment comprises the following steps: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
In an embodiment, the network data acquisition script may be written in python or JavaScript language. The network acquisition device may be a terminal, such as a desktop computer, running a network data acquisition script. And the network acquisition script acquires the information of the website in real time or at regular time according to the website and the acquisition condition provided by research personnel. The preprocessed data may be in json format and stored in the network data file in the form of Key-Value pairs (Key-Value), as an example:
step S20, collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data after the identifiers are added.
In this embodiment, the collected data may utilize flash and the cached data may employ kafka. The flash is a distributed log collection system with high availability and massive log collection, aggregation and transmission, supports various data sending parties (such as Kafka, HDFS and the like) to be customized in the log system, and is convenient for collecting data. kafka is a distributed message queue, can process large amount of data in real time to meet various demand scenarios, and has the capabilities of high performance, persistence, multi-copy backup and horizontal expansion.
In another embodiment of the present invention, after the data to be analyzed is collected, different data identifiers are added to each piece of data to be analyzed corresponding to each collected network data file, the data to be analyzed to which the data identifier is added is backed up to a preset storage space, the backed-up data to be analyzed is cached in a specified cache space by performing a caching operation, and after the caching of the data to be analyzed is successful, the backed-up data to be analyzed is deleted from the preset storage space.
For example, the flash collects data A to be analyzed, adds a data identifier 1 to the data A to be analyzed to form data 1-A to be analyzed, stores the data 1-A to be analyzed in the flash, and sends the data 1-A to be analyzed to the kafka.
After receiving the data 1-A to be analyzed by the kafka, sending the data identifier 1 to the flash, after receiving the data identifier 1 by the flash, indicating that the data to be analyzed with the data identifier 1 is successfully cached by the kafka, finding the data 1-A to be analyzed by the flash according to the received data identifier 1, and deleting the data 1-A to be analyzed.
And step S30, uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space.
In an embodiment of the present invention, kafka uniformly distributes the data to be resolved in the queue to each buffer partition of the buffer space. The data to be analyzed stored in the data queue has the characteristic of first-in first-out. For example: the data queue receives data to be resolved X1 first, then receives data to be resolved X2, and then receives data to be resolved X3, and the sequence of data to be resolved output by the data queue is X1, X2, and X3.
In the same embodiment of the present invention, the partition establishing rule is: if the quantity of the data to be analyzed in the data queue is greater than or equal to a first preset quantity, calculating to obtain first differential speeds (P1-P2) according to the receiving speed (for example, P1 pieces/millisecond) and the output speed (for example, P2 pieces/millisecond) of the data to be analyzed in the queue, and determining the quantity of new partitions corresponding to the current first differential speed according to the mapping relation between the first differential speed and the quantity of the new partitions, which is determined in advance.
The lookup table for the number of partitions based on the first differential speed is as follows:
the partition logout rule is as follows: if the number of the data to be analyzed in the queue is less than or equal to a second preset number, calculating a second difference speed (P2-P1) according to the receiving speed (for example, P1 pieces/millisecond) and the output speed (for example, P2 pieces/millisecond) of the data to be analyzed in the queue, and determining the number of the logout partitions corresponding to the current second difference speed according to the mapping relation between the second difference speed and the number of the logout partitions, which is determined in advance.
The lookup table for the number of partitions logged off according to the first differential speed is as follows:
and step S40, analyzing the corresponding data to be analyzed according to the sequence of the identifiers of the data to be analyzed of each cache partition from small to large, checking the analyzed data by using a preset template, and if the analyzed data passes the check of the preset template, storing the analyzed data in a database.
If the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a manager, after receiving a template creating request based on the warning mail suggestion by the manager, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing the difference part between the new template and the original template into a log file.
The preset template defines a rule for storing the analyzed data into the database and is used for judging whether the analyzed data meet the condition of storing the analyzed data into the database.
In one embodiment of the invention, the preset template specifies the condition of the data stored in the database, and if the parsed data meets the condition of the data stored in the database specified by the preset template, the parsed data is allowed to be stored in the database. For example, a preset template includes four fields of keys (Key values in Key-Value, also called fields) id, title, url, and city, and defines rules for the contents stored in each field.
For example, the content stored in the id field is defined to be only a number, the field in the analyzed data is compared with the field in the preset template, if the field in the analyzed data exists in the preset template, whether the content stored in the id field in the analyzed data is a number is judged, and if the content stored in the id field in the analyzed data is a number, the id field passes the check of the preset template.
In another real-time example, if a new field is found in the analyzed data after the data to be analyzed is analyzed, the new field is added to the original template to form a new template. For example, after a certain data to be analyzed is analyzed, it is found that the data has a new name field, but the field template does not have the field, according to the setting of the administrator, the name of the name field and the content requirement stored in the name field are added to the template to form a new template, and the difference part between the new template and the original template is written into the log file.
The invention also provides an electronic device. Fig. 2 is a schematic view of an internal structure of an electronic device according to an embodiment of the invention.
In this embodiment, the electronic device 1 includes at least a memory 11, a processor 12, a network interface 13, and a communication bus.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, hard disk, multi-media card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, for example a hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic apparatus 1 in other embodiments, such as a plug-in hard disk provided on the electronic apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic apparatus 1. The memory 11 may be used to store not only application software installed in the electronic apparatus 1 and various types of data, such as codes of the data analysis program 10, but also temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the data parser 10.
The network interface 13 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.
The communication bus is used to enable connection communication between these components.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
Fig. 2 only shows the electronic device 1 with the components 11-13 and the data parser 10, and it will be understood by those skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
In the embodiment of the electronic device 1 shown in fig. 2, the memory 11 stores the data analysis program 10, and the processor 12 implements the following steps when executing the data analysis program 10 stored in the memory 11:
an acquisition step: utilizing each predetermined network acquisition device to respectively utilize a pre-configured network data acquisition script to acquire network data in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
The detailed principle is described below with reference to fig. 3, which is a block diagram of the data analysis program 10 and will not be described herein.
Alternatively, in other embodiments, the data parsing program may be divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, which is a schematic diagram of program modules of a data parsing program in an embodiment of the electronic device 1, in this embodiment, the data parsing program 10 may be divided into an obtaining module 110, a caching module 120, an allocating module 130, and a matching module 140, and exemplarily:
the acquisition module 110: the system comprises a network data acquisition script, a network data file and a network data file, wherein the network data acquisition script is used for acquiring network data from a preset website in real time or at regular time respectively by utilizing the preset network data acquisition script, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into an appointed network data file.
The cache module 120: the data processing device is used for collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers.
In this embodiment, the collected data may utilize flash and the cached data may employ kafka. The flash is a distributed log collection system with high availability and massive log collection, aggregation and transmission, supports various data sending parties (such as Kafka, HDFS and the like) to be customized in the log system, and is convenient for collecting data.
In this embodiment, each piece of data to be analyzed has a long-type digital data identifier for distinguishing different data, and the data to be analyzed are arranged together in the order of the corresponding data identifiers from small to large.
For example, if the data identifier corresponding to the data to be analyzed X1 is 1, the data identifier corresponding to the data to be analyzed X2 is 2, and the data identifier corresponding to the data to be analyzed X3 is 3, the data to be analyzed X1, the data to be analyzed X2, and the data to be analyzed X3 are arranged in the order of X1, X2, and X3.
Different data to be analyzed correspond to different data identifications, the size of the long type data identification corresponding to each piece of data to be analyzed is related to the corresponding data receiving time, the long type data corresponding to the data to be analyzed with the earlier receiving time is smaller, and the long type data corresponding to the data to be analyzed with the later receiving time is larger.
For example, if the receiving time T1 of the data to be parsed X1 is earlier than the receiving time T2 of the data to be parsed X2, and the receiving time T2 of the data to be parsed X2 is earlier than the receiving time T3 of the data to be parsed X3, the rule for generating the corresponding long type data identifier for the data to be parsed X1, the data to be parsed X2, and the data to be parsed X3 is: the long type data identification corresponding to the data to be analyzed X1 is smaller than the long type data identification corresponding to the data to be analyzed X2, and the long type data identification corresponding to the data to be analyzed X2 is smaller than the long type data identification corresponding to the data to be analyzed X3.
The distribution module 130: and the cache partition is used for uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space.
If a certain cache partition of the kafka fails to receive data, restarting the partition, checking whether the last data to be analyzed is output successfully, if the data to be analyzed is output successfully, obtaining a data identifier of the last data to be analyzed, sending the data identifier to the flash, and after receiving the data identifier, sending the data to be analyzed behind the sent identifier to the partition according to the data identifier.
For example: the method comprises the steps that the flash sends data to be analyzed, which are marked as 10 to 100, to a first partition of kafka, failure occurs in the process that the data to be analyzed are output by the first partition, the mark 15 of the last data to be analyzed is output by the first partition, after the first partition is restarted, the first partition sends the mark with the value of 15 to a collection module, and the flash sends the data to be analyzed, which are marked as 16 to 100, to the first partition again.
In one embodiment of the invention, the kafka receives the data to be resolved, the data to be resolved is stored in a data queue of the kafka, and then the data to be resolved in the queue is uniformly distributed to each buffer partition of the kafka. The data to be analyzed stored in the data queue has the characteristic of entering first and outputting first. For example: the data queue receives data to be resolved X1 first, then receives data to be resolved X2, and then receives data to be resolved X3, and the sequence of data to be resolved output by the data queue is X1, X2, and X3.
In another embodiment of the present invention, the step of uniformly distributing the data to be resolved in the queue to each partition of kafka comprises:
if there are 100 pieces of data to be resolved in the queue, the identifier of the data to be resolved is from 0 to 99. And averagely dividing the data to be analyzed into 5 sections, wherein the identifier of the first section of data to be analyzed is 0-19, the identifier of the second section of data to be analyzed is 20-39, and the like, and distributing each section of data to be analyzed to each partition.
The matching module 140: and the data analysis module is used for analyzing the corresponding data to be analyzed according to the sequence of the identifiers of the cache partitions from small to large, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in the database.
If the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a manager, after receiving a template creating request based on the warning mail suggestion by the manager, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing the difference part between the new template and the original template into a log file.
The preset template defines a rule for storing the analyzed data into the database and is used for judging whether the analyzed data meet the condition of storing the analyzed data into the database.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a data analysis program, and the data analysis program is executable by one or more processors to implement the following steps:
an acquisition step: utilizing each predetermined network acquisition device to respectively utilize a pre-configured network data acquisition script to acquire network data in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
The embodiment of the storage medium of the present invention is substantially the same as the embodiments of the electronic device 1 and the system, and will not be described herein in a repeated manner.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.