Summary of the invention
The main purpose of the application is to provide a kind of distributed reptile system architecture, crawl the method for data and computer is setIt is standby, it is intended to solve distributed reptile system mine in the prior art and build stability and poor universality, effect is safeguarded in the exploitation of developerThe low problem of rate.
In order to achieve the above-mentioned object of the invention, the application proposes that a kind of distributed reptile system architecture, the design of the framework makeWith HTTP service register mode, different modules is isolated, between different modules using message queue mode intoThe mutual access of row, the framework include:
Task release module, for issuing crawler task;
Crawler service module, for storing with crawler service different existing for service form, different crawler clothesDifferent crawler tasks is completed in business;
Crawler module is arrived for receiving the crawler task of the task release module publication, and according to the crawler taskCrawler service corresponding with the crawler task is called in the crawler service module, utilizes crawler service to targeted websiteIt carries out crawling movement, obtains corresponding original crawler data;
First data memory module, for storing the original crawler data;
Data cleansing module, for cleaning the original crawler data in first data memory module, after obtaining screeningThe first crawler data;
Second data memory module, for storing the first crawler data;
Back Administration Module is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module are obtained for obtaining the log that other modules generate in the system architectureError log in the day handles the corresponding event of the error log according to preset rules.
The application also provides a kind of method that distributed reptile crawls data, based on above-mentioned distributed reptile system trayStructure, comprising:
Crawler task is obtained using the task release module, and the crawler task is sent to the crawler module,The crawler task includes targeted website and crawls requirement;
After the crawler module gets the crawler task, calls into the crawler service module and wanted with described crawlAsk corresponding target crawler to service, and serviced using the target crawler, to the targeted website on crawl original crawler data,Wherein, at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
By the original crawler data storage crawled to preset first memory module.
Further, the described the step of crawler task is sent to the crawler module, comprising:
The task release module sends the crawler task to the crawler module in the form of message queue.
Further, the step of original crawler data storage that will be crawled is to preset first memory module itAfterwards, which comprises
The original crawler data in first memory module are cleaned using data cleansing module, after obtaining cleaningThe first crawler data, and by the first crawler data storage to preset second memory module.
Further, the method also includes:
The log of other modules in the distributed reptile system architecture is obtained using the log and error handling moduleData, and obtain the error log in the daily record data;
The corresponding event of the error log is handled according to preset rules.
Further, after the step of event corresponding according to the preset rules processing error log, comprising:
Generate the error reporting of the corresponding event using the log and error handling module, and by the error reportingIt is sent to preset mailbox.
Further, the step of event corresponding according to the preset rules processing error log, comprising:
Judge whether the event is that crawler is failed using the log and error handling module;
If the event is crawler failure, the corresponding crawler task of the event is issued again.
Further, the method, further includes:
Judge whether to receive the incoming administration order of the Back Administration Module;
If so, administration order described in priority processing.
The application also provides a kind of computer equipment, including memory and processor, and the memory is stored with computerThe step of program, the processor realizes any of the above-described the method when executing the computer program.
The application also provides a kind of computer readable storage medium, is stored thereon with computer program, the computer journeyThe step of method described in any of the above embodiments is realized when sequence is executed by processor.
The distributed reptile system architecture of the application, distributed reptile crawl the method, computer equipment and calculating of dataMachine readable storage medium storing program for executing, the mode that the design of above-mentioned framework is registered using HTTP service, different modules is isolated, differentModule between using the mode of message queue carry out mutual access.Using this design scheme can reduce system module itBetween coupling, and the asynchronous message processing capacity of message queue can facilitate system with the parallel ability of lifting system data processingIt is carried out when promoting processing capacity extending transversely.Crawler service module is set, it is interior for storing crawler service, by entire crawlerThe bottom demand of system is packaged, and carries out modularization, the processing of service, reduces the workload of developer and unlimitedThe development language of developer processed reduces ability need;The stability and extended capability of crawler system are promoted by architecture design,Suitable for the large-scale crawler system exploitation of multitask;Visual Back Administration Module, so that the operation management of whole systemIt is more reliable efficient.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understoodThe application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, notFor limiting the application.
Referring to Fig.1, the application proposes that a kind of distributed reptile system architecture, the design of the framework use HTTP service firstThe mode of registration is isolated by different modules, carries out mutual visit using the mode of message queue between different modulesIt asks, the framework includes:
Task release module 10, for issuing crawler task;
Crawler service module 20, for storing with crawler service different existing for service form, the different crawlersDifferent crawler tasks is completed in service;
Crawler module 30, for receiving the crawler task of task release module publication, and according to the crawler task,Crawler service corresponding with the crawler task is called into the crawler service module, utilizes crawler service to target networkStation carries out crawling movement, obtains corresponding original crawler data;
First data memory module 40, for storing the original crawler data;
Data cleansing module 50 is screened for cleaning the original crawler data in first data memory moduleThe first crawler data afterwards;
Second data memory module 60, for storing the first crawler data;
Back Administration Module 70, is used to form visualization interface, realizes human-computer interaction on the visualization interface;
Then log and error handling module 80 are obtained for obtaining the log that other modules generate in the system architectureThe error log in the day is taken, handles the corresponding event of the error log according to preset rules.
In the present embodiment, the mode that the design of above-mentioned framework is registered using HTTP service, by different module carry out everyFrom carrying out mutual access using the mode of message queue between different modules.It can be reduced using this design scheme and beUnite module between coupling, and the asynchronous message processing capacity of message queue can with the parallel ability of lifting system data processing,System is facilitated to carry out when promoting processing capacity extending transversely.In above-mentioned framework, using the mode of docker containerization by systemEnvironment, module service, storage system be packaged and are integrated, and the mode that script can be used carries out one-touch portion to systemAdministration, starting.When needing to be deployed to new environment, it is only necessary to container file be migrated to the migration for just completing system, transportedThe deployment of system can be completed in row starting script.In above-mentioned framework, crawlers are not compromised by first floor system development languageIt limits and unified language can only be used to be developed;The basis for using module can be provided in system for different development languagesSoftware support;The written in code that crawler developer only needs to be performed service logic in this way forms the service of corresponding crawler, and by itsIn incoming crawler service module, so that it may complete exploitation, maintenance of entire crawlers etc..
Referring to Fig. 2, the embodiment of the present application also provides a kind of method that distributed reptile crawls data, based on such as above-mentioned implementationThe distributed reptile system architecture of example, comprising steps of
S1, crawler task is obtained using the task release module, and the crawler task is sent to the crawler mouldBlock, the crawler task include targeted website and crawl requirement;
After S2, the crawler module get the crawler task, calls into the crawler service module and climbed with describedTake and corresponding target crawler required to service, and serviced using the target crawler, to the targeted website on crawl original crawlerData, wherein at least one is packaged in the crawler service module with the crawler service of service form encapsulation;
S3, the original crawler data crawled are stored to preset first memory module.
As described in above-mentioned steps S1, above-mentioned crawler task includes as targeted website and the crawling requirement of the task.It is above-mentionedTargeted website is that this crawls the data source of data;Above-mentioned crawl requires to be the type for crawling the requirement of data, for example specifyThe data etc. of function are specified in data, targeted website.It includes a variety of for obtaining the mode of crawler task, for example reception user is directly defeatedThe crawler task entered, or receive the crawler task dispatching that system generates.In one crawler task crawl requirement may include it is moreIt is a, for example require to crawl logon data, and require to crawl image recognition data of identifying code etc..
As described in above-mentioned steps S2, above-mentioned crawler service is to refer to complete the corresponding service for crawling task.It is above-mentioned to climbOne or more preset crawler services are provided in worm service module.Service in crawler service module is usually some correspondencesThe common service for crawling requirement, such as simulation Sign-On services, the image recognition service of identifying code, IP agent pool safeguard serviceDeng.In a specific embodiment, it is provided with an invocation list in crawler service module, is stored in list and is reflected in one-to-oneCrawling for penetrating requires and crawls service, when getting after crawling requirement of crawler task, arrives first lookup and its phase in invocation listSame crawls requirement, then gets target according to mapping relations and crawls service, the target is finally called to crawl service.When above-mentionedInclude in crawler task it is multiple crawl when require, while being called.Then target crawler service to mesh is utilizedMark website crawls data.
As described in above-mentioned steps S3, as by the data crawled storage into the first data memory module.Above-mentioned first depositsStorage module is generally a document storage system, and relative low price, can save storage aspect opens money.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising:
S101, the task release module send the crawler task to the crawler module in the form of message queue.
As described in above-mentioned steps S101, message queue is a container, sends crawler task using the form of message queue,Quickly lateral and distribution extension can be carried out when for large-scale crawler task, improve the processing capacity of crawler task.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory moduleAfter step S3, which comprises
S4, the original crawler data in first memory module are cleaned using data cleansing module, is obtained clearThe first crawler data after washing, and the first crawler data are stored to preset second memory module.
As described in above-mentioned steps S4, the cleaning rule of above-mentioned data cleansing module includes a variety of, for example removes duplicate numberAccording to, incomplete data of removal etc., the data of needs can also be filtered out, repeated data etc. is then removed.Above-mentioned secondMemory module can be the subdata base being arranged in above-mentioned first memory module, for example be a text in the first memory modulePart folder etc..In a specific embodiment, above-mentioned second memory module is a number independently of above-mentioned first memory moduleAccording to library, the cost of the second memory module is higher than above-mentioned first memory module, but more convenient to the management of data etc..BecauseThe data volume of original crawler data is larger, so the first memory module that use cost is low, the first crawler data number after cleaningAccording to measure it is relatively fewer, so management easy to use, but higher cost the second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S5, the day that other modules in the distributed reptile system architecture are obtained using the log and error handling moduleWill data, and obtain the error log in the daily record data;
S6, the corresponding event of the error log is handled according to preset rules.
In the present embodiment, the method that above-mentioned distributed reptile crawls data is completed, above-mentioned distributed reptile is relied onSystem architecture is realized, is executed above-mentioned the step of such as cleaning original crawler data, is crawled the step of data, can generate corresponding dayWill data, the application can get up these collection of log data, then utilize existing log analysis method, filter out each logThen error log in data finds corresponding event according to error log and carries out corresponding automatic words processing, such as automaticallyRepeat the step of generating error log etc..
In one embodiment, it is above-mentioned according to preset rules handle the corresponding event of the error log step S6 itAfterwards, comprising:
S7, the error reporting that the corresponding event is generated using the log and error handling module, and by the mistakeReport is sent to preset mailbox.
As described in above-mentioned steps S7, as by error log, to result of the time-triggered protocol etc. according to preset requirementMail Contents are generated, then send mail in preset mailbox.Above-mentioned mailbox can be the mailbox of specified developer.It is above-mentionedMailbox can be multiple and different mailboxes, the corresponding developer of each mailbox, to facilitate developer to obtain wrong feelings in timeCondition.Further, receive the receipt that each mailbox is opened, as long as receiving a receipt, will with the receipt it is not corresponding itsIts withdrawing mail, after preventing multiple developers from seeing mail while handling identical problem.
In one embodiment, the above-mentioned step S7 that the corresponding event of the error log is handled according to preset rules, packetIt includes:
S71, judge whether the event is crawler failure using the log and error handling module;
If S72, the event are crawler failures, the corresponding crawler task of the event is issued again.
As described in above-mentioned steps S71 and S72, when crawler failure, mail notification, record can be carried out to developer in timeLower error reason, and crawler task is rejoined in message queue by error handling logic, it is crawled again;It improvesThe stability of process and the function of carrying out automation O&M.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include:
S8, judge whether to receive the incoming administration order of the Back Administration Module;
S9, if so, administration order described in priority processing.
In the present embodiment, above-mentioned Back Administration Module is monitored entire crawler system by way of management of webpageWith management.Start crawler process in such a way that Back Administration Module can be used and upload script and configuration;It can also be observed thatThere is the crawler task of performance bottleneck, the scale of real-time extension crawler module;It can also be realized by Back Administration Module to beingThe monitoring of all crawler tasks and data analysis etc. in system.
The method that the distributed reptile of the embodiment of the present application crawls data is based on above-mentioned distributed reptile system architecture, shouldThe mode that the design of framework is registered using HTTP service, different modules is isolated, and message is used between different modulesThe mode of queue carries out mutual access.The coupling between system module can be reduced using this design scheme, and message teamThe asynchronous message processing capacity of column can with the parallel ability of lifting system data processing, facilitate system when promoting processing capacity intoRow is extending transversely.Crawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is carried outEncapsulation carries out modularization, and the processing of service reduces the workload of developer, and does not limit the exploitation language of developerSpeech reduces ability need;The stability and extended capability of crawler system are promoted by architecture design, and it is extensive to be suitable for multitaskCrawler system exploitation;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
Referring to Fig. 3, a kind of computer equipment is also provided in the embodiment of the present application, which can be above-mentioned pipeIt manages server or the corresponding server of management node, internal structure can be as shown in Figure 3.The computer equipment includes logicalCross processor, memory, network interface and the database of system bus connection.Wherein, the processor of the Computer Design is used forCalculating and control ability are provided.The memory of the computer equipment includes non-volatile memory medium, built-in storage.This is non-volatileProperty storage medium is stored with operating system, computer program and database.The internal memory is the behaviour in non-volatile memory mediumThe operation for making system and computer program provides environment.The database of the computer equipment is used for distributed storage crawler system frameThe data such as each module of structure.The network interface of the computer equipment is used to communicate with external terminal by network connection.The meterTo realize a kind of method that distributed reptile crawls data when calculation machine program is executed by processor.
Above-mentioned processor executes the method that above-mentioned distributed reptile crawls data, based on the above embodiment in distribution climbWorm system architecture, comprising: obtain crawler task using the task release module, and the crawler task is sent to described climbErpoglyph block, the crawler task include targeted website and crawl requirement;After the crawler module gets the crawler task, arriveIt is called in the crawler service module and requires corresponding target crawler to service with described crawl, and taken using the target crawlerBusiness, to the targeted website on crawl original crawler data, wherein be packaged in the crawler service module at least one with clothesThe crawler service of business form encapsulation;By the original crawler data storage crawled to preset first memory module.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: describedBusiness release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory moduleAfter step, which comprises carried out using data cleansing module to the original crawler data in first memory moduleCleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistakeProcessing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record dataError log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log,It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accusedGive preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising:Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure,The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrappedIt includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
It will be understood by those skilled in the art that structure shown in Fig. 3, only part relevant to application scheme is tiedThe block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme.
The computer equipment of the embodiment of the present application, is based on above-mentioned distributed reptile system architecture, and the design of the framework usesThe mode of HTTP service registration, different modules is isolated, and is carried out between different modules using the mode of message queueMutual access.The coupling between system module can be reduced using this design scheme, and at the asynchronous message of message queueReason ability can facilitate system to carry out when promoting processing capacity extending transversely with the parallel ability of lifting system data processing.IfCrawler service module is set, it is interior for storing crawler service, the bottom demand of entire crawler system is packaged, module is carried outChange, the processing of service reduces the workload of developer, and does not limit the development language of developer, and reducing ability needsIt asks;The stability and extended capability that crawler system is promoted by architecture design, are opened suitable for the large-scale crawler system of multitaskHair;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
One embodiment of the application also provides a kind of computer readable storage medium, is stored thereon with computer program, calculatesMachine program realizes a kind of method that above-mentioned distributed reptile crawls data when being executed by processor, based on the above embodiment in pointCloth crawler system framework, comprising: obtain crawler task using the task release module, and the crawler task is sent toThe crawler module, the crawler task include targeted website and crawl requirement;The crawler module gets the crawler and appointsAfter business, is called into the crawler service module and require corresponding target crawler to service with described crawl, and utilize the targetCrawler service, to the targeted website on crawl original crawler data, wherein be packaged at least one in the crawler service moduleA crawler service with service form encapsulation;By the original crawler data storage crawled to preset first memory module.
The method that above-mentioned distributed reptile crawls data is based on above-mentioned distributed reptile system architecture, the design of the frameworkThe mode registered using HTTP service, different modules is isolated, and the mode of message queue is used between different modulesCarry out mutual access.The coupling between system module can be reduced using this design scheme, and the asynchronous of message queue disappearsCeasing processing capacity can facilitate system to carry out lateral expansion when promoting processing capacity with the parallel ability of lifting system data processingExhibition.Crawler service module is set, it is interior to be serviced for storing crawler, the bottom demand of entire crawler system is packaged, intoRow modularization, the processing of service reduce the workload of developer, and do not limit the development language of developer, reduceAbility need;The stability and extended capability of crawler system are promoted by architecture design, are suitable for the large-scale crawler of multitaskSystem development;Visual Back Administration Module, so that the operation management of whole system is more reliable efficient.
In one embodiment, the above-mentioned the step of crawler task is sent to the crawler module, comprising: describedBusiness release module sends the crawler task to the crawler module in the form of message queue.
In one embodiment, the above-mentioned original crawler data that will be crawled are stored to preset first memory moduleAfter step, which comprises carried out using data cleansing module to the original crawler data in first memory moduleCleaning, the first crawler data after being cleaned, and the first crawler data are stored to preset second memory module.
In one embodiment, the method that above-mentioned distributed reptile crawls data further include: utilize the log and mistakeProcessing module obtains the daily record data of other modules in the distributed reptile system architecture, and obtains in the daily record dataError log;The corresponding event of the error log is handled according to preset rules.
In one embodiment, after the step of above-mentioned event corresponding according to the preset rules processing error log,It include: the error reporting of the corresponding event to be generated using the log and error handling module, and the false alarm is accusedGive preset mailbox.
In one embodiment, the step of above-mentioned event corresponding according to the preset rules processing error log, comprising:Judge whether the event is that crawler is failed using the log and error handling module;If the event is crawler failure,The corresponding crawler task of the event is issued again.
In one embodiment, the method that above-mentioned distributed reptile crawls data, which is characterized in that the method is also wrappedIt includes: judging whether to receive the incoming administration order of the Back Administration Module;If so, administration order described in priority processing.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be withRelevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computerIn read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,Any reference used in provided herein and embodiment to memory, storage, database or other media,Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may includeRandom access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed are according to rate SDRAM (SSRSDRAM), enhancingType SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
The foregoing is merely preferred embodiment of the present application, are not intended to limit the scope of the patents of the application, all utilizationsEquivalent structure or equivalent flow shift made by present specification and accompanying drawing content is applied directly or indirectly in other correlationsTechnical field, similarly include in the scope of patent protection of the application.