Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hairEmbodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative effortsExample, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instructionDescribed feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precludedBody, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodimentAnd be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless onOther situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims isRefer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios signal of web data crawling method provided in an embodiment of the present inventionFigure, Fig. 2 is the flow diagram of web data crawling method provided in an embodiment of the present invention, the web data crawling method applicationIn first server, this method is executed by the application software being installed in first server.
As shown in Fig. 2, the method comprising the steps of S111~S116.
S111, the network address for receiving second server distribution;The network address is that the second server receives user terminal uploadTargeted website network address set subset.
It in the present embodiment, is angle description technique scheme in first server, the first server can be singleA load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network addressIt crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order toWeb page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can chooseIt is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first serverWeb page crawl task.
In one embodiment, before step S111 further include:
Initial deployment application container engine;
It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;
The application container engine corresponding storage region in second server is set.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker containerIn one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker containerIt crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeletonAnalysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second serverStorage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker containerName.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realizedRegion.
S112, the corresponding web page content information of the network address is crawled by the code skeleton of deployment.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, before step S112 further include:
According to the network address, connection is established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network addressWhen corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically, i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicatedFirst server has known the address URL at the target access end of request connection.When Docker container disposed in first serverAfter establishing connection with target access end, web page contents can be crawled from target access end.
S113, the web page content information is parsed by the code skeleton, obtains web analysis content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsedThe information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loadedIt stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code theIn two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneouslyMYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneouslyThe webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpagePart catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawledIt crosses after parsing to retransmit and be stored into second server.
S114, the web analysis content is sent in second server memory block corresponding with the first serverDomain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverStore the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thusRealization is traced to the source.
S115, the source code in the web analysis content is parsed by the code skeleton, obtains corresponding sourceCode parsing information.
In one embodiment, step S115 includes:
Obtain the regular expression rule cluster for being identified to source code constructed in advance in the code skeleton;
By the regular expression rule cluster, obtain in the source code with it is every in the regular expression rule clusterThe one-to-one segmentation source code of one regular expression rule;
It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained and each segmentationThe corresponding segmentation source code of source code parses information;
Each segmentation source code parsing information is combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ...,Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster{rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0;By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code andParsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source codeAnalysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time willIt is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each otherSeparate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realizeSecondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contentsWith JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
S116, source code parsing information is sent in second server memory block corresponding with the first serverDomain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverThe source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second serverInformation is parsed, is traced to the source to realize.
The embodiment of the present invention also provides another web data crawling method, please refers to Fig. 1 and Fig. 3, and Fig. 3 is of the invention realAnother flow diagram of the web data crawling method of example offer is applied, which is applied to second serverIn, this method is executed by the application software being installed in second server.
As shown in figure 3, the method comprising the steps of S121~S125.
The targeted website network address set to be crawled that S121, reception are sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be viewFor primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first serverThe web analysis content and source code that device is sent parse information.
In one embodiment, include: in step S121
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis databaseServer indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead toIt crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address setThe subset of conjunction is to first server.
S122, each network address in the targeted website network address set is distributed to corresponding first server.
In the present embodiment, second server can be in distributing the targeted website network address set when each network addressOne network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
S123, the web analysis content that the first server is sent is received;The web analysis content is by described firstServer crawls and parses the corresponding web page content information of the network address and obtains.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is includedThe source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis contentWhen, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and firstIn the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second serverIt is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequentRetrieval.
S124, the source code parsing information that the first server is sent is received;The source code parsing information is by described firstThe source code correspondence that server parses the web analysis content obtains.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtainedIt is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses informationAccording in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first serverCorresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information,Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network addressAnd second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realizeNetwork address is crawled by first server and information obtained after parsing is stored in the same area.
S125, search condition is received, according to the search condition in the web analysis content and source code parsing letterCorresponding search result is obtained in breath.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawledLocation) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second serverThe historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correctingWhen leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URLLocation is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URLSource code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canalRoad.
The method achieve the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpageContent carries out secondary parsing.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls deviceAny embodiment of web data crawling method.The embodiment of the present invention also provides a kind of web data and crawls system, the webpage numberIt include first server and second server according to the system of crawling.Specifically, referring to Fig. 4, Fig. 4 is provided in an embodiment of the present inventionWeb data crawls the schematic block diagram of device.The web data crawls device 100 and can be configured in first server.
As shown in figure 4, web data crawl device 100 include network address receiving unit 111, web page contents crawl unit 112,First resolution unit 113, the first transmission unit 114, the second resolution unit 115, the second transmission unit 116.
Network address receiving unit 111, for receiving the network address of second server distribution;The network address is the second serverReceive the subset for the targeted website network address set that user terminal uploads.
It in the present embodiment, is angle description technique scheme in first server, the first server can be singleA load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network addressIt crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order toWeb page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can chooseIt is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first serverWeb page crawl task.
In one embodiment, web data crawls device 100 further include:
Container deployment unit is used for initial deployment application container engine;
Code skeleton deployment unit crawls web page contents and parsing net for being packaged in the application container engineThe code skeleton of page content;
Storage region setting unit, for the application container engine corresponding memory block in second server to be arrangedDomain.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker containerIn one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker containerIt crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeletonAnalysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second serverStorage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker containerName.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realizedRegion.
Web page contents crawl unit 112, crawl the corresponding web page contents of the network address for the code skeleton by disposingInformation.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, web data crawls device 100 further include:
Connection establishment unit, for according to the network address, connection to be established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network addressWhen corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically, i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicatedFirst server has known the address URL at the target access end of request connection.When Docker container disposed in first serverAfter establishing connection with target access end, web page contents can be crawled from target access end.
First resolution unit 113 obtains net for parsing the web page content information by the code skeletonPage parsing content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsedThe information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loadedIt stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code theIn two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneouslyMYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneouslyThe webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpagePart catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawledIt crosses after parsing to retransmit and be stored into second server.
First transmission unit 114, for the web analysis content to be sent in second server and first clothesThe corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverStore the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thusRealization is traced to the source.
Second resolution unit 115, for solving the source code in the web analysis content by the code skeletonAnalysis obtains corresponding source code parsing information.
In one embodiment, the second resolution unit 115 includes:
Regular cluster acquiring unit, for obtain constructed in advance in the code skeleton for being identified to source codeRegular expression rule cluster;
Be segmented source code acquiring unit, for by the regular expression rule cluster, obtain in the source code with it is describedEach one-to-one segmentation source code of regular expression rule in regular expression rule cluster;
It is segmented source code resolution unit, for passing through parsing corresponding with the rule of regular expression corresponding to each segmentation source codeIt is segmented source code, obtains segmentation source code parsing information corresponding with each segmentation source code;
Information assembled unit, for each segmentation source code parsing information to be combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ...,Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster{rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0;By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code andParsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source codeAnalysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time willIt is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each otherSeparate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realizeSecondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contentsWith JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
Second transmission unit 116, for source code parsing information to be sent in second server and first clothesThe corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverThe source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second serverInformation is parsed, is traced to the source to realize.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls deviceAny embodiment of web data crawling method.Specifically, referring to Fig. 5, Fig. 5 is web data provided in an embodiment of the present inventionCrawl another schematic block diagram of device.The web data crawls device 100 and can be configured in second server.
As shown in figure 5, web data crawl device 100 include address set close receiving unit 121, network address Dispatching Unit 122,First storage unit 123, the second storage unit 124, retrieval unit 125.
Address set closes receiving unit 121, for receiving the targeted website network address set to be crawled sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be viewFor primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first serverThe web analysis content and source code that device is sent parse information.
In one embodiment, address set closes receiving unit 121 and is specifically used for:
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis databaseServer indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead toIt crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address setThe subset of conjunction is to first server.
Network address Dispatching Unit 122, for each network address in the targeted website network address set to be distributed to corresponding firstServer.
In the present embodiment, second server can be in distributing the targeted website network address set when each network addressOne network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
First storage unit 123, the web analysis content sent for receiving the first server;The web analysisContent is crawled by the first server and parses the corresponding web page content information of the network address and obtained.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is includedThe source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis contentWhen, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and firstIn the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second serverIt is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequentRetrieval.
Second storage unit 124 parses information for receiving the source code that the first server is sent;The source code parsingInformation is obtained by the source code correspondence that the first server parses the web analysis content.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtainedIt is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses informationAccording in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first serverCorresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information,Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network addressAnd second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realizeNetwork address is crawled by first server and information obtained after parsing is stored in the same area.
Retrieval unit 125, for receiving search condition, according to the search condition in the web analysis content and describedCorresponding search result is obtained in source code parsing information.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawledLocation) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second serverThe historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correctingWhen leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URLLocation is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URLSource code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canalRoad.
The arrangement achieves the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpageContent carries out secondary parsing.
Above-mentioned web data, which crawls device, can be implemented as the form of computer program, which can such as schemeIt is run in computer equipment shown in 6.
Referring to Fig. 6, Fig. 6 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer equipment500 be server.Wherein, server can be independent server, be also possible to the server cluster of multiple server compositions.
Refering to Fig. 6, which includes processor 502, memory and the net connected by system bus 501Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program5032 are performed, and processor 502 may make to execute web data crawling method.
The processor 502 supports the operation of entire computer equipment 500 for providing calculating and control ability.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, shouldWhen computer program 5032 is executed by processor 502, processor 502 may make to execute web data crawling method.
The network interface 505 is for carrying out network communication, such as the transmission of offer data information.Those skilled in the art canTo understand, structure shown in Fig. 6, only the block diagram of part-structure relevant to the present invention program, is not constituted to this hairThe restriction for the computer equipment 500 that bright scheme is applied thereon, specific computer equipment 500 may include than as shown in the figureMore or fewer components perhaps combine certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following functionCan: receive the network address of second server distribution;The network address is the targeted website that the second server receives that user terminal uploadsThe subset of network address set;The corresponding web page content information of the network address is crawled by the code skeleton of deployment;It will be in the webpageHold information to be parsed by the code skeleton, obtains web analysis content;The web analysis content is sent to secondStorage region corresponding with the first server is stored in server;Source code in the web analysis content is passed throughThe code skeleton is parsed, and corresponding source code parsing information is obtained;And source code parsing information is sent to secondStorage region corresponding with the first server is stored in server.
In one embodiment, processor 502 is before the step of executing the network address for receiving second server distribution, alsoIt performs the following operations: initial deployment application container engine;Be packaged for crawling in the application container engine web page contents andThe code skeleton of analyzing web page content;The application container engine corresponding storage region in second server is set.
In one embodiment, to crawl the network address corresponding executing the code skeleton by deployment for processor 502It before the step of web page content information, also performs the following operations: according to the network address, target access corresponding with network address endEstablish connection.
In one embodiment, processor 502 passes through the generation in the execution source code by the web analysis contentCode frame is parsed, and when obtaining the step of corresponding source code parsing information, is performed the following operations: being obtained in the code skeletonThe regular expression rule cluster for being identified to source code constructed in advance;By the regular expression rule cluster,It obtains in the source code and is segmented source code correspondingly with regular expression rule each in the regular expression rule cluster;It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained corresponding with each segmentation source codeSegmentation source code parse information;Each segmentation source code parsing information is combined, to obtain source code parsing information.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following functionCan: receive the targeted website network address set to be crawled sent by user terminal;By each net in the targeted website network address setLocation is distributed to corresponding first server;Receive the web analysis content that the first server is sent;In the web analysisAppearance, which is crawled by the first server and parses the corresponding web page content information of the network address, to be obtained;Receive the first serverThe source code of transmission parses information;The source code parsing information is parsed the source code of the web analysis content by the first serverCorrespondence obtains;And search condition is received, according to the search condition in the web analysis content and source code parsing letterCorresponding search result is obtained in breath.
In one embodiment, processor 502 is executing the targeted website net to be crawled for receiving and being sent by user terminalWhen the step of location set, perform the following operations: received by remote date transmission database sent by user terminal it is to be crawledTargeted website network address set.
It will be understood by those skilled in the art that the embodiment of computer equipment shown in Fig. 6 is not constituted to computerThe restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, orPerson combines certain components or different component layouts.For example, in some embodiments, computer equipment can only include depositingReservoir and processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 6,Details are not described herein.
It should be appreciated that in embodiments of the present invention, processor 502 can be central processing unit (CentralProcessing Unit, CPU), which can also be other general processors, digital signal processor (DigitalSignal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devicesPart, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor orThe processor is also possible to any conventional processor etc..
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be withFor non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculatingMachine program performs the steps of the network address for receiving second server distribution when being executed by processor;The network address is described secondServer receives the subset for the targeted website network address set that user terminal uploads;The network address pair is crawled by the code skeleton of deploymentThe web page content information answered;The web page content information is parsed by the code skeleton, obtains web analysis content;The web analysis content is sent in second server storage region corresponding with the first server to store;It willSource code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing information;AndSource code parsing information is sent in second server storage region corresponding with the first server to store.
In one embodiment, before the network address for receiving second server distribution, further includes: initial deployment application containerEngine;It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;Setting instituteState application container engine corresponding storage region in second server.
In one embodiment, the code skeleton by deployment crawl the corresponding web page content information of the network address itBefore, further includes: according to the network address, connection is established at target access corresponding with network address end.
In one embodiment, the source code by the web analysis content is parsed by the code skeleton,Obtain corresponding source code parsing information, comprising: obtain constructed in advance in the code skeleton for being identified to source codeRegular expression rule cluster;By the regular expression rule cluster, obtain in the source code with the regular expressionEach one-to-one segmentation source code of regular expression rule in regular cluster;By with canonical table corresponding to each segmentation source codeUp to the corresponding parsing segmentation source code of formula rule, segmentation source code parsing information corresponding with each segmentation source code is obtained;By each segmentation sourceCode parsing information is combined, to obtain source code parsing information.
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be withFor non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculatingMachine program performs the steps of the targeted website network address set to be crawled for receiving and being sent by user terminal when being executed by processor;Each network address in the targeted website network address set is distributed to corresponding first server;The first server is received to sendWeb analysis content;The web analysis content is crawled by the first server and is parsed in the corresponding webpage of the network addressHold information to obtain;Receive the source code parsing information that the first server is sent;The source code parsing information is taken by described firstThe source code correspondence that business device parses the web analysis content obtains;And search condition is received, according to the search condition in instituteIt states in web analysis content and source code parsing information and obtains corresponding search result.
In one embodiment, described to receive the targeted website network address set to be crawled sent by user terminal, comprising: to pass throughRemote date transmission database receives the targeted website network address set to be crawled sent by user terminal.
It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is setThe specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.Those of ordinary skill in the art may be aware that unit described in conjunction with the examples disclosed in the embodiments of the present disclosure and algorithmStep can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and softwareInterchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefullyUnexpectedly the specific application and design constraint depending on technical solution are implemented in hardware or software.Professional technicianEach specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceedThe scope of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed unit and method, it can be withIt realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unitIt divides, only logical function partition, there may be another division manner in actual implementation, can also will be with the same functionUnit set is at a unit, such as multiple units or components can be combined or can be integrated into another system or someFeature can be ignored, or not execute.In addition, shown or discussed mutual coupling, direct-coupling or communication connection canBe through some interfaces, the indirect coupling or communication connection of device or unit, be also possible to electricity, mechanical or other shapesFormula connection.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unitThe component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multipleIn network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needsPurpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unitIt is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integratedUnit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent productWhen, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existingThe all or part of part or the technical solution that technology contributes can be embodied in the form of software products, shouldComputer software product is stored in a storage medium, including some instructions are used so that a computer equipment (can bePersonal computer, server or network equipment etc.) execute all or part of step of each embodiment the method for the present inventionSuddenly.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk orThe various media that can store program code such as person's CD.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, anyThose familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replaceIt changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with rightIt is required that protection scope subject to.