Movatterモバイル変換


[0]ホーム

URL:


CN110704381A - Data analysis method, device and storage medium - Google Patents

Data analysis method, device and storage medium
Download PDF

Info

Publication number
CN110704381A
CN110704381ACN201910850992.XACN201910850992ACN110704381ACN 110704381 ACN110704381 ACN 110704381ACN 201910850992 ACN201910850992 ACN 201910850992ACN 110704381 ACN110704381 ACN 110704381A
Authority
CN
China
Prior art keywords
data
analyzed
preset
network data
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910850992.XA
Other languages
Chinese (zh)
Inventor
陈万慧
苏雪婷
杨鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Urban Construction Technology Shenzhen Co Ltd
Original Assignee
Ping An Urban Construction Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Urban Construction Technology Shenzhen Co LtdfiledCriticalPing An Urban Construction Technology Shenzhen Co Ltd
Priority to CN201910850992.XApriorityCriticalpatent/CN110704381A/en
Publication of CN110704381ApublicationCriticalpatent/CN110704381A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention relates to a data acquisition technology and provides a data analysis method, an electronic device and a storage medium. The method comprises the following steps: acquiring network data from a preset website by using a pre-configured network data acquisition script, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file; collecting data to be analyzed from the network data files, adding different data identifiers to each data to be analyzed corresponding to each network data file, and caching the data added with the identifiers; uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and analyzing the data to be analyzed in the cache partition, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification, storing the analyzed data into a database. By the method and the device, the data before being stored in the database can be audited, and the efficiency of storing the data in the database is improved.

Description

Data analysis method, device and storage medium
Technical Field
The present invention relates to the field of data acquisition technologies, and in particular, to a data analysis method, an apparatus, and a storage medium.
Background
With the rapid development of networks, the world wide web has become an important data source in the field of data analysis as a carrier of a large amount of information, and in the prior art, data is generally automatically acquired from the world wide web by using a data acquisition program or script.
Currently, after data is acquired by using such data acquisition programs or scripts, the industry often needs to remind related personnel to create a database table to store the data. The manual prompting method for creating the table needs a large amount of manual intervention, cannot ensure the real-time property of data storage, and is easy to make mistakes and needs a large amount of manual time.
Disclosure of Invention
The invention provides a data analysis method, a data analysis device and a storage medium, and mainly aims to audit data before being stored in a database and improve the efficiency of storing the data in the database.
In order to achieve the above object, the present invention provides a data parsing method, including the steps of:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
Preferably, if the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a preset client, after receiving a template creating request submitted by the preset client in response to the warning information, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing a difference part between the new template and the original template into a log file.
Preferably, the pre-treatment comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
Preferably, the partition establishing rule includes determining the number of the newly-built partitions corresponding to the current first differential speed according to a mapping relationship between the first differential speed and the number of the newly-built partitions, which is determined in advance; and
and the partition logout rule comprises the step of determining the number of the logged-out partitions corresponding to the current second differential speed according to the mapping relation between the predetermined second differential speed and the number of the logged-out partitions.
Preferably, if the number of the data to be analyzed in the cache partition is greater than or equal to a first preset number, a new cache partition is created according to a predetermined partition establishment rule; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the corresponding number of the cache partitions according to a predetermined partition canceling rule.
In addition, to achieve the above object, the present invention further provides an electronic device, which includes a memory and a processor, wherein the memory stores a data analysis program operable on the processor, and the data analysis program implements the following steps when executed by the processor:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
Preferably, if the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a preset client, after receiving a template creating request submitted by the preset client in response to the warning information, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing a difference part between the new template and the original template into a log file.
Preferably, the pre-treatment comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
Preferably, if the number of the data to be analyzed in the cache partition is greater than or equal to a first preset number, a new cache partition is created according to a predetermined partition establishment rule; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the corresponding number of the cache partitions according to a predetermined partition canceling rule.
Compared with the prior art, the method and the device have the advantages that the network data are acquired from the preset website by using the pre-configured network data acquisition script, different data identifiers are respectively added to each data to be analyzed, the mapping relation between the data to be analyzed and the identifiers is established, and the data added with the identifiers are cached; uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and analyzing the data to be analyzed of each cache partition, matching the analyzed data with a preset template, and if the data is successfully matched with the preset template, storing the analyzed data in a database, thereby effectively realizing the examination of the data before being stored in the database and improving the efficiency of storing the data in the database.
Drawings
FIG. 1 is a flow chart of an embodiment of a data parsing method according to the present invention;
FIG. 2 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the invention;
FIG. 3 is a block diagram of an embodiment of the data parsing process of FIG. 2.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
In this embodiment, the data analysis method includes:
step S10, acquiring network data from a preset website in real time or at regular time by using a pre-configured network data acquisition script, preprocessing the acquired network data, and saving the preprocessed network data as data to be analyzed into a specified network data file.
The pretreatment comprises the following steps: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
In an embodiment, the network data acquisition script may be written in python or JavaScript language. The network acquisition device may be a terminal, such as a desktop computer, running a network data acquisition script. And the network acquisition script acquires the information of the website in real time or at regular time according to the website and the acquisition condition provided by research personnel. The preprocessed data may be in json format and stored in the network data file in the form of Key-Value pairs (Key-Value), as an example:
Figure BDA0002194492360000041
Figure BDA0002194492360000042
Figure BDA0002194492360000051
step S20, collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data after the identifiers are added.
In this embodiment, the collected data may utilize flash and the cached data may employ kafka. The flash is a distributed log collection system with high availability and massive log collection, aggregation and transmission, supports various data sending parties (such as Kafka, HDFS and the like) to be customized in the log system, and is convenient for collecting data. kafka is a distributed message queue, can process large amount of data in real time to meet various demand scenarios, and has the capabilities of high performance, persistence, multi-copy backup and horizontal expansion.
In another embodiment of the present invention, after the data to be analyzed is collected, different data identifiers are added to each piece of data to be analyzed corresponding to each collected network data file, the data to be analyzed to which the data identifier is added is backed up to a preset storage space, the backed-up data to be analyzed is cached in a specified cache space by performing a caching operation, and after the caching of the data to be analyzed is successful, the backed-up data to be analyzed is deleted from the preset storage space.
For example, the flash collects data A to be analyzed, adds a data identifier 1 to the data A to be analyzed to form data 1-A to be analyzed, stores the data 1-A to be analyzed in the flash, and sends the data 1-A to be analyzed to the kafka.
After receiving the data 1-A to be analyzed by the kafka, sending the data identifier 1 to the flash, after receiving the data identifier 1 by the flash, indicating that the data to be analyzed with the data identifier 1 is successfully cached by the kafka, finding the data 1-A to be analyzed by the flash according to the received data identifier 1, and deleting the data 1-A to be analyzed.
And step S30, uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space.
In an embodiment of the present invention, kafka uniformly distributes the data to be resolved in the queue to each buffer partition of the buffer space. The data to be analyzed stored in the data queue has the characteristic of first-in first-out. For example: the data queue receives data to be resolved X1 first, then receives data to be resolved X2, and then receives data to be resolved X3, and the sequence of data to be resolved output by the data queue is X1, X2, and X3.
In the same embodiment of the present invention, the partition establishing rule is: if the quantity of the data to be analyzed in the data queue is greater than or equal to a first preset quantity, calculating to obtain first differential speeds (P1-P2) according to the receiving speed (for example, P1 pieces/millisecond) and the output speed (for example, P2 pieces/millisecond) of the data to be analyzed in the queue, and determining the quantity of new partitions corresponding to the current first differential speed according to the mapping relation between the first differential speed and the quantity of the new partitions, which is determined in advance.
The lookup table for the number of partitions based on the first differential speed is as follows:
Figure BDA0002194492360000061
the partition logout rule is as follows: if the number of the data to be analyzed in the queue is less than or equal to a second preset number, calculating a second difference speed (P2-P1) according to the receiving speed (for example, P1 pieces/millisecond) and the output speed (for example, P2 pieces/millisecond) of the data to be analyzed in the queue, and determining the number of the logout partitions corresponding to the current second difference speed according to the mapping relation between the second difference speed and the number of the logout partitions, which is determined in advance.
The lookup table for the number of partitions logged off according to the first differential speed is as follows:
Figure BDA0002194492360000062
and step S40, analyzing the corresponding data to be analyzed according to the sequence of the identifiers of the data to be analyzed of each cache partition from small to large, checking the analyzed data by using a preset template, and if the analyzed data passes the check of the preset template, storing the analyzed data in a database.
If the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a manager, after receiving a template creating request based on the warning mail suggestion by the manager, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing the difference part between the new template and the original template into a log file.
The preset template defines a rule for storing the analyzed data into the database and is used for judging whether the analyzed data meet the condition of storing the analyzed data into the database.
In one embodiment of the invention, the preset template specifies the condition of the data stored in the database, and if the parsed data meets the condition of the data stored in the database specified by the preset template, the parsed data is allowed to be stored in the database. For example, a preset template includes four fields of keys (Key values in Key-Value, also called fields) id, title, url, and city, and defines rules for the contents stored in each field.
For example, the content stored in the id field is defined to be only a number, the field in the analyzed data is compared with the field in the preset template, if the field in the analyzed data exists in the preset template, whether the content stored in the id field in the analyzed data is a number is judged, and if the content stored in the id field in the analyzed data is a number, the id field passes the check of the preset template.
In another real-time example, if a new field is found in the analyzed data after the data to be analyzed is analyzed, the new field is added to the original template to form a new template. For example, after a certain data to be analyzed is analyzed, it is found that the data has a new name field, but the field template does not have the field, according to the setting of the administrator, the name of the name field and the content requirement stored in the name field are added to the template to form a new template, and the difference part between the new template and the original template is written into the log file.
The invention also provides an electronic device. Fig. 2 is a schematic view of an internal structure of an electronic device according to an embodiment of the invention.
In this embodiment, the electronic device 1 includes at least a memory 11, a processor 12, a network interface 13, and a communication bus.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, hard disk, multi-media card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, for example a hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic apparatus 1 in other embodiments, such as a plug-in hard disk provided on the electronic apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic apparatus 1. The memory 11 may be used to store not only application software installed in the electronic apparatus 1 and various types of data, such as codes of the data analysis program 10, but also temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the data parser 10.
The network interface 13 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.
The communication bus is used to enable connection communication between these components.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
Fig. 2 only shows the electronic device 1 with the components 11-13 and the data parser 10, and it will be understood by those skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
In the embodiment of the electronic device 1 shown in fig. 2, the memory 11 stores the data analysis program 10, and the processor 12 implements the following steps when executing the data analysis program 10 stored in the memory 11:
an acquisition step: utilizing each predetermined network acquisition device to respectively utilize a pre-configured network data acquisition script to acquire network data in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
The detailed principle is described below with reference to fig. 3, which is a block diagram of the data analysis program 10 and will not be described herein.
Alternatively, in other embodiments, the data parsing program may be divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, which is a schematic diagram of program modules of a data parsing program in an embodiment of the electronic device 1, in this embodiment, the data parsing program 10 may be divided into an obtaining module 110, a caching module 120, an allocating module 130, and a matching module 140, and exemplarily:
the acquisition module 110: the system comprises a network data acquisition script, a network data file and a network data file, wherein the network data acquisition script is used for acquiring network data from a preset website in real time or at regular time respectively by utilizing the preset network data acquisition script, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into an appointed network data file.
The cache module 120: the data processing device is used for collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers.
In this embodiment, the collected data may utilize flash and the cached data may employ kafka. The flash is a distributed log collection system with high availability and massive log collection, aggregation and transmission, supports various data sending parties (such as Kafka, HDFS and the like) to be customized in the log system, and is convenient for collecting data.
In this embodiment, each piece of data to be analyzed has a long-type digital data identifier for distinguishing different data, and the data to be analyzed are arranged together in the order of the corresponding data identifiers from small to large.
For example, if the data identifier corresponding to the data to be analyzed X1 is 1, the data identifier corresponding to the data to be analyzed X2 is 2, and the data identifier corresponding to the data to be analyzed X3 is 3, the data to be analyzed X1, the data to be analyzed X2, and the data to be analyzed X3 are arranged in the order of X1, X2, and X3.
Different data to be analyzed correspond to different data identifications, the size of the long type data identification corresponding to each piece of data to be analyzed is related to the corresponding data receiving time, the long type data corresponding to the data to be analyzed with the earlier receiving time is smaller, and the long type data corresponding to the data to be analyzed with the later receiving time is larger.
For example, if the receiving time T1 of the data to be parsed X1 is earlier than the receiving time T2 of the data to be parsed X2, and the receiving time T2 of the data to be parsed X2 is earlier than the receiving time T3 of the data to be parsed X3, the rule for generating the corresponding long type data identifier for the data to be parsed X1, the data to be parsed X2, and the data to be parsed X3 is: the long type data identification corresponding to the data to be analyzed X1 is smaller than the long type data identification corresponding to the data to be analyzed X2, and the long type data identification corresponding to the data to be analyzed X2 is smaller than the long type data identification corresponding to the data to be analyzed X3.
The distribution module 130: and the cache partition is used for uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space.
If a certain cache partition of the kafka fails to receive data, restarting the partition, checking whether the last data to be analyzed is output successfully, if the data to be analyzed is output successfully, obtaining a data identifier of the last data to be analyzed, sending the data identifier to the flash, and after receiving the data identifier, sending the data to be analyzed behind the sent identifier to the partition according to the data identifier.
For example: the method comprises the steps that the flash sends data to be analyzed, which are marked as 10 to 100, to a first partition of kafka, failure occurs in the process that the data to be analyzed are output by the first partition, the mark 15 of the last data to be analyzed is output by the first partition, after the first partition is restarted, the first partition sends the mark with the value of 15 to a collection module, and the flash sends the data to be analyzed, which are marked as 16 to 100, to the first partition again.
In one embodiment of the invention, the kafka receives the data to be resolved, the data to be resolved is stored in a data queue of the kafka, and then the data to be resolved in the queue is uniformly distributed to each buffer partition of the kafka. The data to be analyzed stored in the data queue has the characteristic of entering first and outputting first. For example: the data queue receives data to be resolved X1 first, then receives data to be resolved X2, and then receives data to be resolved X3, and the sequence of data to be resolved output by the data queue is X1, X2, and X3.
In another embodiment of the present invention, the step of uniformly distributing the data to be resolved in the queue to each partition of kafka comprises:
if there are 100 pieces of data to be resolved in the queue, the identifier of the data to be resolved is from 0 to 99. And averagely dividing the data to be analyzed into 5 sections, wherein the identifier of the first section of data to be analyzed is 0-19, the identifier of the second section of data to be analyzed is 20-39, and the like, and distributing each section of data to be analyzed to each partition.
The matching module 140: and the data analysis module is used for analyzing the corresponding data to be analyzed according to the sequence of the identifiers of the cache partitions from small to large, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in the database.
If the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a manager, after receiving a template creating request based on the warning mail suggestion by the manager, re-analyzing the data to be analyzed in the cache partition, creating a new template according to the analyzed data, and writing the difference part between the new template and the original template into a log file.
The preset template defines a rule for storing the analyzed data into the database and is used for judging whether the analyzed data meet the condition of storing the analyzed data into the database.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a data analysis program, and the data analysis program is executable by one or more processors to implement the following steps:
an acquisition step: utilizing each predetermined network acquisition device to respectively utilize a pre-configured network data acquisition script to acquire network data in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
The embodiment of the storage medium of the present invention is substantially the same as the embodiments of the electronic device 1 and the system, and will not be described herein in a repeated manner.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A data analysis method is applied to an electronic device, and is characterized by comprising the following steps:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
2. The data parsing method of claim 1, the method further comprising: if the analyzed data does not pass the verification of the preset template, sending warning information with a preset format and a modification suggestion to a preset client, re-analyzing the data to be analyzed in the cache partition after receiving a template creating request submitted by the preset client in response to the warning information, creating a new template according to the analyzed data, and writing the difference part between the new template and the original template into a log file.
3. The data parsing method of claim 1, wherein the preprocessing comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
4. The data analysis method according to claim 1, wherein if the number of the data to be analyzed in the cache partition is greater than or equal to a first preset number, a new cache partition is created according to a predetermined partition creation rule; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the corresponding number of the cache partitions according to a predetermined partition canceling rule.
5. The data parsing method of claim 4, wherein the partition establishing rule includes determining a number of new partitions corresponding to a current first differential speed according to a mapping relationship between the first differential speed and the number of new partitions, which is predetermined; and
and the partition logout rule comprises the step of determining the number of the logged-out partitions corresponding to the current second differential speed according to the mapping relation between the predetermined second differential speed and the number of the logged-out partitions.
6. An electronic device comprising a memory and a processor, wherein the memory includes a data parser that, when executed by the processor, performs the steps of:
an acquisition step: respectively utilizing a pre-configured network data acquisition script to acquire network data from a preset website in real time or at regular time, preprocessing the acquired network data, and storing the preprocessed network data serving as data to be analyzed into a specified network data file;
a caching step: collecting data to be analyzed from the network data files in real time or at regular time, adding different data identifiers to each data to be analyzed corresponding to each collected network data file, and caching the data added with the identifiers;
a distribution step: uniformly distributing the successfully cached data to be analyzed to each cache partition of the cache space; and
matching: and analyzing the corresponding data to be analyzed according to the sequence of the small to large identifications of the cache partitions, verifying the analyzed data by using a preset template, and if the analyzed data passes the verification of the preset template, storing the analyzed data in a database.
7. The electronic device according to claim 6, wherein if the parsed data fails to pass the verification of the preset template, sending a warning message with a preset format and a modification suggestion to a preset client, re-parsing the data to be parsed in the cache partition after receiving a template creation request submitted by the preset client in response to the warning message, creating a new template according to the parsed data, and writing a difference portion between the new template and the original template into the log file.
8. The electronic device of claim 6, wherein the pre-processing comprises: deleting special punctuation characters in the network data, converting full angles of numbers and letters into half angles, removing double quotation marks in the network data, converting date formats in the network data into preset formats, and deleting repeated data in the network data.
9. The electronic device according to any one of claims 6 to 8, wherein if the number of the data to be parsed in the cache partition is greater than or equal to a first preset number, the number of the cache partition is newly created according to a predetermined partition creation rule to increase the partition for storing the data to be parsed; and
and if the number of the data to be analyzed in the cache partitions is less than or equal to a second preset number, canceling the number of the cache partitions according to a predetermined partition canceling rule so as to reduce the partitions for storing the data to be analyzed.
10. A computer-readable storage medium, comprising a data parsing program which, when executed by a processor, implements the steps of the data parsing method according to any one of claims 1 to 5.
CN201910850992.XA2019-09-062019-09-06Data analysis method, device and storage mediumPendingCN110704381A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910850992.XACN110704381A (en)2019-09-062019-09-06Data analysis method, device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910850992.XACN110704381A (en)2019-09-062019-09-06Data analysis method, device and storage medium

Publications (1)

Publication NumberPublication Date
CN110704381Atrue CN110704381A (en)2020-01-17

Family

ID=69195927

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910850992.XAPendingCN110704381A (en)2019-09-062019-09-06Data analysis method, device and storage medium

Country Status (1)

CountryLink
CN (1)CN110704381A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111428132A (en)*2020-03-182020-07-17腾讯科技(深圳)有限公司Data verification method and device, computer storage medium and electronic equipment
CN111506573A (en)*2020-03-162020-08-07中国平安人寿保险股份有限公司Database table partitioning method and device, computer equipment and storage medium
CN111797613A (en)*2020-07-102020-10-20泰康保险集团股份有限公司Data file processing method and device
CN112347801A (en)*2020-10-272021-02-09任玉海 A kind of electronic chip information data analysis method
CN112764908A (en)*2021-01-262021-05-07北京鼎普科技股份有限公司Network data acquisition processing method and device and electronic equipment
CN113055378A (en)*2021-03-112021-06-29武汉虹信科技发展有限责任公司Protocol conversion platform for industrial internet identification analysis and data docking method
CN113297216A (en)*2021-05-172021-08-24中国人民解放军63920部队Real-time storage method for space flight measurement and control data
CN114266259A (en)*2021-12-302022-04-01中国民航信息网络股份有限公司Message processing method, system, electronic equipment and storage medium
CN116049293A (en)*2023-03-232023-05-02北京沐融信息科技股份有限公司Method, device, equipment and medium for realizing analysis of CSV file based on database configuration
CN116245089A (en)*2023-02-132023-06-09四川神州行网约车服务有限公司 Data template reverse analysis method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103577513A (en)*2012-07-182014-02-12德商赛克美国有限公司Systems and/or methods for caching xml information sets with delayed node instantiation
CN104408190A (en)*2014-12-152015-03-11北京国双科技有限公司Spark based data processing method and device
CN104731859A (en)*2015-02-022015-06-24厦门市美亚柏科信息股份有限公司Data processing method and device
CN106598496A (en)*2016-12-082017-04-26蓝信工场(北京)科技有限公司Method and device for constructing virtual magnetic disk and processing data
CN106897344A (en)*2016-07-212017-06-27阿里巴巴集团控股有限公司The data operation request treatment method and device of distributed data base
CN109039743A (en)*2018-08-032018-12-18西安东美信息科技有限公司The centralized management method of distributed storage ceph cluster networks
CN109064345A (en)*2018-08-142018-12-21中国平安人寿保险股份有限公司Message treatment method, system and computer readable storage medium
CN109657125A (en)*2018-12-142019-04-19平安城市建设科技(深圳)有限公司Data processing method, device, equipment and storage medium based on web crawlers
CN110008266A (en)*2019-03-132019-07-12平安信托有限责任公司Data interchange file analysis method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103577513A (en)*2012-07-182014-02-12德商赛克美国有限公司Systems and/or methods for caching xml information sets with delayed node instantiation
CN104408190A (en)*2014-12-152015-03-11北京国双科技有限公司Spark based data processing method and device
CN104731859A (en)*2015-02-022015-06-24厦门市美亚柏科信息股份有限公司Data processing method and device
CN106897344A (en)*2016-07-212017-06-27阿里巴巴集团控股有限公司The data operation request treatment method and device of distributed data base
CN106598496A (en)*2016-12-082017-04-26蓝信工场(北京)科技有限公司Method and device for constructing virtual magnetic disk and processing data
CN109039743A (en)*2018-08-032018-12-18西安东美信息科技有限公司The centralized management method of distributed storage ceph cluster networks
CN109064345A (en)*2018-08-142018-12-21中国平安人寿保险股份有限公司Message treatment method, system and computer readable storage medium
CN109657125A (en)*2018-12-142019-04-19平安城市建设科技(深圳)有限公司Data processing method, device, equipment and storage medium based on web crawlers
CN110008266A (en)*2019-03-132019-07-12平安信托有限责任公司Data interchange file analysis method and device

Cited By (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111506573A (en)*2020-03-162020-08-07中国平安人寿保险股份有限公司Database table partitioning method and device, computer equipment and storage medium
CN111506573B (en)*2020-03-162024-03-12中国平安人寿保险股份有限公司Database table partitioning method, device, computer equipment and storage medium
CN111428132B (en)*2020-03-182023-09-19腾讯科技(深圳)有限公司Data verification method and device, computer storage medium and electronic equipment
CN111428132A (en)*2020-03-182020-07-17腾讯科技(深圳)有限公司Data verification method and device, computer storage medium and electronic equipment
CN111797613A (en)*2020-07-102020-10-20泰康保险集团股份有限公司Data file processing method and device
CN111797613B (en)*2020-07-102024-04-09泰康保险集团股份有限公司Data file processing method and device
CN112347801A (en)*2020-10-272021-02-09任玉海 A kind of electronic chip information data analysis method
CN112764908B (en)*2021-01-262024-01-26北京鼎普科技股份有限公司Network data acquisition processing method and device and electronic equipment
CN112764908A (en)*2021-01-262021-05-07北京鼎普科技股份有限公司Network data acquisition processing method and device and electronic equipment
CN113055378B (en)*2021-03-112022-07-05武汉虹信科技发展有限责任公司Protocol conversion platform for industrial internet identification analysis and data docking method
CN113055378A (en)*2021-03-112021-06-29武汉虹信科技发展有限责任公司Protocol conversion platform for industrial internet identification analysis and data docking method
CN113297216A (en)*2021-05-172021-08-24中国人民解放军63920部队Real-time storage method for space flight measurement and control data
CN113297216B (en)*2021-05-172024-06-11中国人民解放军63920部队Real-time warehousing method for aerospace measurement and control data
CN114266259A (en)*2021-12-302022-04-01中国民航信息网络股份有限公司Message processing method, system, electronic equipment and storage medium
CN116245089A (en)*2023-02-132023-06-09四川神州行网约车服务有限公司 Data template reverse analysis method, device, equipment and storage medium
CN116049293A (en)*2023-03-232023-05-02北京沐融信息科技股份有限公司Method, device, equipment and medium for realizing analysis of CSV file based on database configuration
CN116049293B (en)*2023-03-232024-02-13北京沐融信息科技股份有限公司Method, device, equipment and medium for realizing analysis of CSV file based on database configuration

Similar Documents

PublicationPublication DateTitle
CN110704381A (en)Data analysis method, device and storage medium
CN108564339B (en)Account management method, device, terminal equipment and storage medium
US9075893B1 (en)Providing files with cacheable portions
CN109800207B (en) Log parsing method, apparatus, device, and computer-readable storage medium
CN107453960B (en)Method, device and system for processing test data in service test
CN108874558B (en)Message subscription method of distributed transaction, electronic device and readable storage medium
WO2020177384A1 (en)Method and apparatus for reporting and processing user message status of message pushing, and storage medium
EP3188051B1 (en)Systems and methods for search template generation
CN110019350A (en)Data query method and apparatus based on configuration information
CN102693242B (en)Network comment information sharing method and system
KR20130066603A (en)Initiating font subsets
CN113420032B (en) A log classification storage method and device
US20180309842A1 (en)Method, device, terminal, server and storage medium of processing network request and response
CN113449339A (en)Log collection method, system, computer device and computer readable storage medium
CN113010542A (en)Service data processing method and device, computer equipment and storage medium
CN112818026B (en) Data integration method and device
CN113760242B (en)Data processing method, device, server and medium
CN110737655A (en)Method and device for reporting data
CN113886419B (en)SQL sentence processing method, device, computer equipment and storage medium
CN108197465B (en) A kind of website detection method and device
CN117193907B (en) Page processing method and device
CN111782244A (en) Configuration file update method, device, computer equipment and storage medium
CN113364848B (en)File caching method and device, electronic equipment and storage medium
CN116450723A (en)Data extraction method, device, computer equipment and storage medium
CN114861054A (en) Information collection method, device, electronic device and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
WD01Invention patent application deemed withdrawn after publication
WD01Invention patent application deemed withdrawn after publication

Application publication date:20200117


[8]ページ先頭

©2009-2025 Movatter.jp