Disclosure of Invention
The invention aims to provide a multi-platform distributed data crawling method based on asynchronous aiohttp, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-platform distributed data crawling method based on asynchronous aiohttp comprises the following steps:
step one, a service center distributes url tasks to a client;
secondly, the client reads configuration from the configuration center;
step three, a plurality of clients download contents to a plurality of url tasks;
step four, analyzing, cleaning and storing the downloaded content into a database;
collecting logs by using a log center;
and step six, opening the monitoring center to check the resource condition and the result.
As a preferred technical scheme of the invention, the url task distribution in the step one is specifically operated as follows: creating a message theme, dividing a plurality of themes for a plurality of platforms, performing task multi-platform distribution, performing duplicate removal on a single platform url task, and sending url tasks on different themes.
As a preferred technical solution of the present invention, the read configuration function in the step two includes: client agent, failure retry, custom request information, synchronous, asynchronous operation mode selection, timeout control, request time, task white list, client middleware, database type selection, performance setting, request type.
As the preferred technical scheme of the invention, the specific operations of downloading the content in the step three are as follows: and sending a request to the url task by using the aiohttp library, and receiving returned data.
As a preferred technical scheme of the invention, the data cleaning in the step four comprises the following specific operations: analyzing the returned data, analyzing the contents of different types by using different libraries, analyzing the data in the json format by using the json library, analyzing the html format by using the xpath library, extracting other texts by using the re regular library, storing the pictures and video byte streams by using a binary system, and storing the cleaned data in the database.
As a preferred technical solution of the present invention, the log collection in the step five specifically operates as follows: logging modules are utilized to record logs in different modes for different log levels of the client, such as five levels of DEBUG, INFO, WARNING, ERROR and CRITICAL, log information is recorded by using files, HTTP GET/POST, SMTP and Socket modes are adopted for recording, and the file format is log files generally.
As a preferred technical solution of the present invention, the monitoring in the step six includes: the usability of the client resource can record the server problem and inform the server problem when the system is shut down, the server resource trend and the system activity are analyzed, the data amount condition of data crawling and warehousing is carried out, the log recording condition of the client is recorded, a WEB interface is provided for setting the client, and the result is checked.
As a preferred technical scheme of the invention, the task multi-platform distribution principle is that a kafka publishing and subscribing message delivery mode is utilized, a publisher sends a message to topic, and only a subscriber who subscribes to topic receives the message, so that multiple platforms can be distinguished, and the task deduplication principle utilizes the set characteristic of redis self-provided with the deduplication function.
As the preferred technical scheme of the invention, the aiohttp principle is to utilize async asynchronous characteristic to send the request, and the network io can not cause blockage, thereby realizing high concurrency and high availability.
As a preferred technical scheme of the invention, the principle of the configuration center is that one service is started as a service side, then each service needing to be configured is used as a client side to obtain configuration by the service side, the configuration of tens of thousands of client sides is unified, the platform is unified, the client sides are highly available, the maintenance cost is low, and the aiohttp, xpath, json, logging, kafka and redis belong to an open source library.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, through a task multi-platform distribution and subscription message issuing and subscribing mode by utilizing kafka, only a subscriber subscribing to topic receives a message, so that the multi-platform can be distinguished, and the task deduplication principle utilizes the set characteristic of redis self-provided with the deduplication function; the invention utilizes async asynchronous characteristic to send the request, the network io can not cause blockage, and high concurrency and high availability are realized; according to the method, one service is started to serve as a service side, and then each service needing to be configured is taken as a client side to obtain configuration by the service side, so that configuration of tens of thousands of client sides is unified, a platform is unified, the client sides are highly available, the maintenance cost is low, and the data crawling speed can be effectively increased through the method.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the following embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the equipment or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Examples
The technical scheme provided by the embodiment is as follows:
a multi-platform distributed data crawling method based on asynchronous aiohttp, as shown in fig. 1, includes the following steps:
step one, a service center distributes url tasks to a client;
secondly, the client reads configuration from the configuration center;
step three, a plurality of clients download contents to a plurality of url tasks;
step four, analyzing, cleaning and storing the downloaded content into a database;
collecting logs by using a log center;
and step six, opening the monitoring center to check the resource condition and the result.
As a preferred technical solution of this embodiment, as shown in fig. 2, url task distribution in step one is specifically performed: creating a message theme, dividing a plurality of themes for a plurality of platforms, performing task multi-platform distribution, performing duplicate removal on a single platform url task, and sending url tasks on different themes.
As a preferred technical solution of this embodiment, the read configuration function in step two includes: client agent, failure retry, custom request information, synchronous, asynchronous operation mode selection, timeout control, request time, task white list, client middleware, database type selection, performance setting, request type.
As a preferred technical solution of this embodiment, as shown in fig. 4, the specific operations of downloading content in step three are: and sending a request to the url task by using the aiohttp library, and receiving returned data.
As a preferred technical solution of this embodiment, as shown in fig. 5, the data cleansing in step four specifically operates: analyzing the returned data, analyzing the contents of different types by using different libraries, analyzing the data in the json format by using the json library, analyzing the html format by using the xpath library, extracting other texts by using the re regular library, storing the pictures and video byte streams by using a binary system, and storing the cleaned data in the database.
As a preferred technical solution of this embodiment, as shown in fig. 6, the log collection operation in step five is specifically: logging modules are utilized to record logs in different modes for different log levels of the client, such as five levels of DEBUG, INFO, WARNING, ERROR and CRITICAL, log information is recorded by using files, HTTP GET/POST, SMTP and Socket modes are adopted for recording, and the file format is log files generally.
As a preferred technical solution of this embodiment, as shown in fig. 7, the monitoring in step six includes: the usability of the client resource can record the server problem and inform the server problem when the system is shut down, the server resource trend and the system activity are analyzed, the data amount condition of data crawling and warehousing is carried out, the log recording condition of the client is recorded, a WEB interface is provided for setting the client, and the result is checked.
As a preferred technical solution of this embodiment, the task multi-platform distribution principle is that a kafka publish-subscribe message delivery mode is used, a publisher sends a message to topic, and only a subscriber who subscribes to topic receives the message, so that multiple platforms can be distinguished, and the task deduplication principle uses a set characteristic of redis to carry a deduplication function.
As a preferred technical solution of this embodiment, the aiohttp principle is to send a request by using async asynchronous characteristics, and the network io does not cause congestion, thereby achieving high concurrency and high availability.
As a preferred technical solution of this embodiment, as shown in fig. 3, a principle of the configuration center is to start a service as a service party, and then each service that needs to be configured is obtained and configured as a client by the service party, so that configuration of ten thousand clients is uniformly configured, a platform is uniform, the client is highly available, and maintenance cost is low.
As a preferred technical solution of this embodiment, aiohttp, xpath, json, logging, kafka, and redis all belong to an open source library.
The implementation environment of the asynchronous aiohttp-based multi-platform distributed data crawling method is as follows: the system comprises the following steps: windows10 system, CPU: i7-8700, memory: 24g, process: 1 process, 1 thread, 500 coroutines request at the same time.
Comparative example 1
A data crawling method based on single-platform synchronous operation comprises the following steps:
step one, a service center distributes url tasks to a client;
secondly, the client reads configuration from the configuration center;
step three, a single client downloads content to a single url task;
step four, analyzing, cleaning and storing the downloaded content into a database;
collecting logs by using a log center;
and step six, opening the monitoring center to check the resource condition and the result.
Comparative example 2
A data crawling method based on asynchronous aiohttp single platform operation comprises the following steps:
step one, a service center distributes url tasks to a client;
secondly, the client reads configuration from the configuration center;
step three, a single client downloads content to a single url task;
step four, analyzing, cleaning and storing the downloaded content into a database;
collecting logs by using a log center;
and step six, opening the monitoring center to check the resource condition and the result.
The examples and comparative example 1 differ in that: comparative example 1 does not employ the aiohttp principle, and the number of url tasks and the number of processing clients of comparative example 1 are different from those of the embodiment, and the rest are the same.
The difference between the example and the comparative example 2 is that: the number of url tasks and the number of processing clients of comparative example 2 are different from those of the example, and the rest are the same.
Experimental comparison is performed on the asynchronous aiohttp-based multi-platform distributed data crawling method provided by the invention, the traditional single-platform synchronous operation-based data crawling method and the traditional single-platform asynchronous operation-based data crawling method, and as shown in fig. 8, the following data are obtained:
according to the data of the table, compared with other two methods, the speed of the asynchronous aiohttp-based multi-platform distributed data crawling method provided by the invention is greatly improved.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.