according to the data of the table, compared with other two methods, the speed of the asynchronous aiohttp-based multi-platform distributed data crawling method provided by the invention is greatly improved.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

Translated fromChinese

1.一种基于异步aiohttp多平台分布式数据爬取方法，其特征在于：该方法的步骤如下：1. a method for crawling data based on asynchronous aiohttp multi-platform distributed data, is characterized in that: the steps of the method are as follows:

步骤一、服务中心分发url任务到客户端；Step 1. The service center distributes the url task to the client;

步骤二、客户端从配置中心读取配置；Step 2. The client reads the configuration from the configuration center;

步骤三、多个客户端对多个url任务进行下载内容；Step 3. Multiple clients download content for multiple url tasks;

步骤四、对下载的内容进行解析并清洗存入数据库；Step 4: Parse the downloaded content, clean it and store it in the database;

步骤五、利用日志中心收集日志；Step 5. Use the log center to collect logs;

步骤六、打开监控中心查看资源情况、查看结果。Step 6. Open the monitoring center to view the resource status and the results.

2.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤一中的url任务分发具体操作：创建消息主题，并对多平台划分多个主题，施行任务多平台分发，对单个平台url任务进行去重，对不同的主题进行发送url任务。2. based on asynchronous aiohttp multi-platform distributed data crawling method described according to claim 1, it is characterized in that: the url task in the step 1 distributes concrete operations: create message theme, and divide a plurality of themes to multi-platform, carry out task Multi-platform distribution, deduplication of a single platform url task, and sending url tasks to different themes.

3.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤二中的读取配置功能包括：客户端代理、失败重试、自定义请求信息、同步、异步运行模式选择、超时控制、请求时间、任务白名单、客户端中间件、数据库类型选择、性能设置、请求类型。3. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the read configuration function in the step 2 comprises: client agent, failure retry, self-defined request information, synchronization, Asynchronous operation mode selection, timeout control, request time, task whitelist, client-side middleware, database type selection, performance settings, request type.

4.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤三中的下载内容具体操作为：利用aiohttp库对url任务发送请求，接收返回的数据。4. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the concrete operation of downloading content in step 3 is: utilize aiohttp library to send request to url task, receive the data that returns.

5.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤四中的数据清洗具体操作：对返回的数据进行解析，对不用类型的内容用用不用的库去解析，对json格式数据用json库解析，对html格式用xpath库解析，对其他文本用re正则库进行提取，图片、视频字节流用二进制进行保存，对清洗完的数据存入数据库。5. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the concrete operation of data cleaning in the step 4: the data returned is parsed, the content of different types is used without Use the json library to parse the json format data, use the xpath library to parse the html format, use the re regular library to extract other texts, save the image and video byte streams in binary, and store the cleaned data into the database.

6.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤五中的日志收集具体操作为：利用logging模块对客户端不同的日志级别，如DEBUG、INFO、WARNING、ERROR、CRITICAL五种级别，使用不同的方式记录日志，如使用文件记录日志信息，采用HTTP GET/POST，SMTP，Socket方式记录，文件格式一般为.log文件。6. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the log collection concrete operation in the step 5 is: utilize logging module to the different log levels of clients, such as DEBUG, INFO , WARNING, ERROR, CRITICAL five levels, using different methods to record logs, such as using files to record log information, using HTTP GET/POST, SMTP, Socket methods to record, the file format is generally .log file.

7.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：步骤六中的监控包括：客户端资源的可用性，它会记录服务器问题并在停机的时候通知，分析服务器资源趋势、系统活动，数据爬取入库的数据量情况，客户端日志记录情况，提供WEB界面对客户端设置，查看结果。7. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the monitoring in step 6 comprises: the availability of client resource, it can record server problem and notify when shutting down, Analyze server resource trends, system activities, data volume of data crawling and storage, client log records, provide WEB interface for client settings, and view results.

8.根据权利要求2所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：任务多平台分发原理是利用kafka发布-订阅消息传递模式，发布者发送到topic的消息，只有订阅了topic的订阅者才会收到消息，这样可以把多平台进行区分，任务去重原理利用redis的set集合特性自带去重功能。8. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 2, it is characterized in that: the principle of task multi-platform distribution is to utilize kafka to publish-subscribe message delivery mode, the message that publisher sends to topic, only has subscription Only the subscribers of the topic will receive the message, which can distinguish multiple platforms. The principle of task deduplication uses the set collection feature of redis to bring its own deduplication function.

9.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：aiohttp原理是利用async异步特性进行发送请求，网络io不会造成阻塞，实现高并发，高可用。9. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: aiohttp principle is to utilize async asynchronous characteristic to send request, network io will not cause blocking, realize high concurrency, high availability.

10.根据权利要求1所描述的基于异步aiohttp多平台分布式数据爬取方法，其特征在于：配置中心的原理就是启动一个服务作为服务方，之后各个需要获取配置的服务作为客户端来这个服务方获取配置，做到上万台客户端配置统一配置，平台统一，客户端高可用，维护成本低，aiohttp、xpath、json、logging、kafka、redis都属于开源库。10. according to the described asynchronous aiohttp multi-platform distributed data crawling method according to claim 1, it is characterized in that: the principle of the configuration center is to start a service as a service party, and then each need to obtain the configured service as a client to come to this service The configuration can be obtained from the square, so that tens of thousands of client configurations are uniformly configured, the platform is unified, the client is highly available, and the maintenance cost is low. aiohttp, xpath, json, logging, kafka, and redis are all open source libraries.