Movatterモバイル変換


[0]ホーム

URL:


CN109885744A - Web data crawling method, device, system, computer equipment and storage medium - Google Patents

Web data crawling method, device, system, computer equipment and storage medium
Download PDF

Info

Publication number
CN109885744A
CN109885744ACN201910012240.6ACN201910012240ACN109885744ACN 109885744 ACN109885744 ACN 109885744ACN 201910012240 ACN201910012240 ACN 201910012240ACN 109885744 ACN109885744 ACN 109885744A
Authority
CN
China
Prior art keywords
server
source code
web
network address
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910012240.6A
Other languages
Chinese (zh)
Other versions
CN109885744B (en
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co LtdfiledCriticalPing An Technology Shenzhen Co Ltd
Priority to CN201910012240.6ApriorityCriticalpatent/CN109885744B/en
Publication of CN109885744ApublicationCriticalpatent/CN109885744A/en
Application grantedgrantedCritical
Publication of CN109885744BpublicationCriticalpatent/CN109885744B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The invention discloses web data crawling method, device, system, computer equipment and storage mediums.This method comprises: receiving the network address of second server distribution;The corresponding web page content information of the network address is crawled by the code skeleton of deployment;The web page content information is parsed by the code skeleton, obtains web analysis content;The web analysis content is sent in second server storage region corresponding with the first server to store;Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing information;And source code parsing information is sent in second server storage region corresponding with the first server and is stored.The method achieve carrying out the web page contents crawled being saved in order to data tracing to the source, and secondary parsing can also be carried out to web page contents.

Description

Web data crawling method, device, system, computer equipment and storage medium
Technical field
The present invention relates to data to crawl technical field more particularly to a kind of web data crawling method, device, system, meterCalculate machine equipment and storage medium.
Background technique
It is crawled currently, crawler system is oriented both for specified content, and frequently encounters website revision, orWhen mistake occurs for the position of data grabber, results in the need for re-starting crawl, cause web page contents post-production cost relatively high.
Summary of the invention
The embodiment of the invention provides a kind of web data crawling method, device, system, computer equipment and storages to be situated betweenMatter, it is intended to it solves crawler system in the prior art and is oriented and crawls both for specified content, when encountering website revision,Or data grabber position occur mistake when, the problem of need to crawling and can not trace to the source again.
In a first aspect, being applied to first server, packet the embodiment of the invention provides a kind of web data crawling methodIt includes:
Receive the network address of second server distribution;The network address is the target that the second server receives that user terminal uploadsThe subset of website set;
The corresponding web page content information of the network address is crawled by the code skeleton of deployment;
The web page content information is parsed by the code skeleton, obtains web analysis content;
By the web analysis content be sent in second server storage region corresponding with the first server intoRow storage;
Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsingInformation;And
By the source code parsing information be sent in second server storage region corresponding with the first server intoRow storage.
Second aspect, the embodiment of the present invention provide a kind of web data crawling method again, are applied to second server,Include:
Receive the targeted website network address set to be crawled sent by user terminal;
Each network address in the targeted website network address set is distributed to corresponding first server;
Receive the web analysis content that the first server is sent;The web analysis content is by the first serverIt crawls and parses the corresponding web page content information of the network address and obtain;
Receive the source code parsing information that the first server is sent;The source code parsing information is by the first serverThe source code correspondence for parsing the web analysis content obtains;And
Search condition is received, is obtained in the web analysis content and source code parsing information according to the search conditionTake corresponding search result.
The third aspect, the embodiment of the present invention provide a kind of web data again and crawl device, which includes for executingThe corresponding unit of web data crawling method described in above-mentioned first aspect, or including for executing net described in above-mentioned second aspectThe corresponding unit of page data crawling method.
Fourth aspect, the embodiment of the present invention provide a kind of web data again and crawl system, including first server andTwo servers, the first server is for executing web data crawling method described in above-mentioned first aspect, the second serviceDevice is for executing web data crawling method described in above-mentioned first aspect.
5th aspect, the embodiment of the present invention provide a kind of computer equipment again comprising memory, processor and storageOn the memory and the computer program that can run on the processor, the processor execute the computer programWeb data crawling method described in the above-mentioned first aspect of Shi Shixian, or realize that web data described in above-mentioned second aspect is climbedTake method.
6th aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer canIt reads storage medium and is stored with computer program, it is above-mentioned that the computer program when being executed by a processor executes the processorWeb data crawling method described in first aspect, or execute web data crawling method described in above-mentioned second aspect.
The embodiment of the invention provides a kind of web data crawling method, device, system, computer equipment and storages to be situated betweenMatter.This method includes receiving the network address of second server distribution;The corresponding net of the network address is crawled by the code skeleton of deploymentPage content information;The web page content information is parsed by the code skeleton, obtains web analysis content;It will be describedWeb analysis content is sent in second server storage region corresponding with the first server and is stored;By the netSource code in page parsing content is parsed by the code skeleton, obtains corresponding source code parsing information;And it will be describedSource code parsing information is sent in second server storage region corresponding with the first server and is stored.This method is realThe web page contents that will have been crawled are showed to carry out being saved in order to data tracing to the source, and secondary solution can also have been carried out to web page contentsAnalysis.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment descriptionAttached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this fieldFor logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is that web data provided in an embodiment of the present invention crawls systematic difference schematic diagram of a scenario;
Fig. 2 is the flow diagram of web data crawling method provided in an embodiment of the present invention;
Fig. 3 is another flow diagram of web data crawling method provided in an embodiment of the present invention;
Fig. 4 is the schematic block diagram that web data provided in an embodiment of the present invention crawls device;
Fig. 5 is another schematic block diagram that web data provided in an embodiment of the present invention crawls device;
Fig. 6 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hairEmbodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative effortsExample, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instructionDescribed feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precludedBody, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodimentAnd be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless onOther situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims isRefer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios signal of web data crawling method provided in an embodiment of the present inventionFigure, Fig. 2 is the flow diagram of web data crawling method provided in an embodiment of the present invention, the web data crawling method applicationIn first server, this method is executed by the application software being installed in first server.
As shown in Fig. 2, the method comprising the steps of S111~S116.
S111, the network address for receiving second server distribution;The network address is that the second server receives user terminal uploadTargeted website network address set subset.
It in the present embodiment, is angle description technique scheme in first server, the first server can be singleA load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network addressIt crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order toWeb page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can chooseIt is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first serverWeb page crawl task.
In one embodiment, before step S111 further include:
Initial deployment application container engine;
It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;
The application container engine corresponding storage region in second server is set.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker containerIn one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker containerIt crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeletonAnalysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second serverStorage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker containerName.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realizedRegion.
S112, the corresponding web page content information of the network address is crawled by the code skeleton of deployment.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, before step S112 further include:
According to the network address, connection is established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network addressWhen corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically, i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicatedFirst server has known the address URL at the target access end of request connection.When Docker container disposed in first serverAfter establishing connection with target access end, web page contents can be crawled from target access end.
S113, the web page content information is parsed by the code skeleton, obtains web analysis content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsedThe information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loadedIt stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code theIn two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneouslyMYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneouslyThe webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpagePart catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawledIt crosses after parsing to retransmit and be stored into second server.
S114, the web analysis content is sent in second server memory block corresponding with the first serverDomain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverStore the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thusRealization is traced to the source.
S115, the source code in the web analysis content is parsed by the code skeleton, obtains corresponding sourceCode parsing information.
In one embodiment, step S115 includes:
Obtain the regular expression rule cluster for being identified to source code constructed in advance in the code skeleton;
By the regular expression rule cluster, obtain in the source code with it is every in the regular expression rule clusterThe one-to-one segmentation source code of one regular expression rule;
It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained and each segmentationThe corresponding segmentation source code of source code parses information;
Each segmentation source code parsing information is combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ...,Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster{rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0;By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code andParsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source codeAnalysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time willIt is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each otherSeparate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realizeSecondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contentsWith JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
S116, source code parsing information is sent in second server memory block corresponding with the first serverDomain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverThe source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second serverInformation is parsed, is traced to the source to realize.
The embodiment of the present invention also provides another web data crawling method, please refers to Fig. 1 and Fig. 3, and Fig. 3 is of the invention realAnother flow diagram of the web data crawling method of example offer is applied, which is applied to second serverIn, this method is executed by the application software being installed in second server.
As shown in figure 3, the method comprising the steps of S121~S125.
The targeted website network address set to be crawled that S121, reception are sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be viewFor primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first serverThe web analysis content and source code that device is sent parse information.
In one embodiment, include: in step S121
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis databaseServer indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead toIt crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address setThe subset of conjunction is to first server.
S122, each network address in the targeted website network address set is distributed to corresponding first server.
In the present embodiment, second server can be in distributing the targeted website network address set when each network addressOne network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
S123, the web analysis content that the first server is sent is received;The web analysis content is by described firstServer crawls and parses the corresponding web page content information of the network address and obtains.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is includedThe source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis contentWhen, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and firstIn the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second serverIt is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequentRetrieval.
S124, the source code parsing information that the first server is sent is received;The source code parsing information is by described firstThe source code correspondence that server parses the web analysis content obtains.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtainedIt is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses informationAccording in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first serverCorresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information,Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network addressAnd second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realizeNetwork address is crawled by first server and information obtained after parsing is stored in the same area.
S125, search condition is received, according to the search condition in the web analysis content and source code parsing letterCorresponding search result is obtained in breath.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawledLocation) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second serverThe historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correctingWhen leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URLLocation is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URLSource code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canalRoad.
The method achieve the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpageContent carries out secondary parsing.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls deviceAny embodiment of web data crawling method.The embodiment of the present invention also provides a kind of web data and crawls system, the webpage numberIt include first server and second server according to the system of crawling.Specifically, referring to Fig. 4, Fig. 4 is provided in an embodiment of the present inventionWeb data crawls the schematic block diagram of device.The web data crawls device 100 and can be configured in first server.
As shown in figure 4, web data crawl device 100 include network address receiving unit 111, web page contents crawl unit 112,First resolution unit 113, the first transmission unit 114, the second resolution unit 115, the second transmission unit 116.
Network address receiving unit 111, for receiving the network address of second server distribution;The network address is the second serverReceive the subset for the targeted website network address set that user terminal uploads.
It in the present embodiment, is angle description technique scheme in first server, the first server can be singleA load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network addressIt crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order toWeb page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can chooseIt is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first serverWeb page crawl task.
In one embodiment, web data crawls device 100 further include:
Container deployment unit is used for initial deployment application container engine;
Code skeleton deployment unit crawls web page contents and parsing net for being packaged in the application container engineThe code skeleton of page content;
Storage region setting unit, for the application container engine corresponding memory block in second server to be arrangedDomain.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker containerIn one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker containerIt crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeletonAnalysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second serverStorage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker containerName.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realizedRegion.
Web page contents crawl unit 112, crawl the corresponding web page contents of the network address for the code skeleton by disposingInformation.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, web data crawls device 100 further include:
Connection establishment unit, for according to the network address, connection to be established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address,Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network addressWhen corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically, i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicatedFirst server has known the address URL at the target access end of request connection.When Docker container disposed in first serverAfter establishing connection with target access end, web page contents can be crawled from target access end.
First resolution unit 113 obtains net for parsing the web page content information by the code skeletonPage parsing content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsedThe information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loadedIt stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code theIn two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneouslyMYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneouslyThe webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpagePart catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawledIt crosses after parsing to retransmit and be stored into second server.
First transmission unit 114, for the web analysis content to be sent in second server and first clothesThe corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverStore the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thusRealization is traced to the source.
Second resolution unit 115, for solving the source code in the web analysis content by the code skeletonAnalysis obtains corresponding source code parsing information.
In one embodiment, the second resolution unit 115 includes:
Regular cluster acquiring unit, for obtain constructed in advance in the code skeleton for being identified to source codeRegular expression rule cluster;
Be segmented source code acquiring unit, for by the regular expression rule cluster, obtain in the source code with it is describedEach one-to-one segmentation source code of regular expression rule in regular expression rule cluster;
It is segmented source code resolution unit, for passing through parsing corresponding with the rule of regular expression corresponding to each segmentation source codeIt is segmented source code, obtains segmentation source code parsing information corresponding with each segmentation source code;
Information assembled unit, for each segmentation source code parsing information to be combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ...,Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster{rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0;By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code andParsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source codeAnalysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time willIt is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each otherSeparate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realizeSecondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contentsWith JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
Second transmission unit 116, for source code parsing information to be sent in second server and first clothesThe corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second serverThe source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second serverInformation is parsed, is traced to the source to realize.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls deviceAny embodiment of web data crawling method.Specifically, referring to Fig. 5, Fig. 5 is web data provided in an embodiment of the present inventionCrawl another schematic block diagram of device.The web data crawls device 100 and can be configured in second server.
As shown in figure 5, web data crawl device 100 include address set close receiving unit 121, network address Dispatching Unit 122,First storage unit 123, the second storage unit 124, retrieval unit 125.
Address set closes receiving unit 121, for receiving the targeted website network address set to be crawled sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be viewFor primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first serverThe web analysis content and source code that device is sent parse information.
In one embodiment, address set closes receiving unit 121 and is specifically used for:
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis databaseServer indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead toIt crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address setThe subset of conjunction is to first server.
Network address Dispatching Unit 122, for each network address in the targeted website network address set to be distributed to corresponding firstServer.
In the present embodiment, second server can be in distributing the targeted website network address set when each network addressOne network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
First storage unit 123, the web analysis content sent for receiving the first server;The web analysisContent is crawled by the first server and parses the corresponding web page content information of the network address and obtained.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is includedThe source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis contentWhen, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and firstIn the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second serverIt is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequentRetrieval.
Second storage unit 124 parses information for receiving the source code that the first server is sent;The source code parsingInformation is obtained by the source code correspondence that the first server parses the web analysis content.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtainedIt is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses informationAccording in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first serverCorresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information,Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network addressAnd second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realizeNetwork address is crawled by first server and information obtained after parsing is stored in the same area.
Retrieval unit 125, for receiving search condition, according to the search condition in the web analysis content and describedCorresponding search result is obtained in source code parsing information.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawledLocation) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second serverThe historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correctingWhen leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URLLocation is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URLSource code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canalRoad.
The arrangement achieves the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpageContent carries out secondary parsing.
Above-mentioned web data, which crawls device, can be implemented as the form of computer program, which can such as schemeIt is run in computer equipment shown in 6.
Referring to Fig. 6, Fig. 6 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer equipment500 be server.Wherein, server can be independent server, be also possible to the server cluster of multiple server compositions.
Refering to Fig. 6, which includes processor 502, memory and the net connected by system bus 501Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program5032 are performed, and processor 502 may make to execute web data crawling method.
The processor 502 supports the operation of entire computer equipment 500 for providing calculating and control ability.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, shouldWhen computer program 5032 is executed by processor 502, processor 502 may make to execute web data crawling method.
The network interface 505 is for carrying out network communication, such as the transmission of offer data information.Those skilled in the art canTo understand, structure shown in Fig. 6, only the block diagram of part-structure relevant to the present invention program, is not constituted to this hairThe restriction for the computer equipment 500 that bright scheme is applied thereon, specific computer equipment 500 may include than as shown in the figureMore or fewer components perhaps combine certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following functionCan: receive the network address of second server distribution;The network address is the targeted website that the second server receives that user terminal uploadsThe subset of network address set;The corresponding web page content information of the network address is crawled by the code skeleton of deployment;It will be in the webpageHold information to be parsed by the code skeleton, obtains web analysis content;The web analysis content is sent to secondStorage region corresponding with the first server is stored in server;Source code in the web analysis content is passed throughThe code skeleton is parsed, and corresponding source code parsing information is obtained;And source code parsing information is sent to secondStorage region corresponding with the first server is stored in server.
In one embodiment, processor 502 is before the step of executing the network address for receiving second server distribution, alsoIt performs the following operations: initial deployment application container engine;Be packaged for crawling in the application container engine web page contents andThe code skeleton of analyzing web page content;The application container engine corresponding storage region in second server is set.
In one embodiment, to crawl the network address corresponding executing the code skeleton by deployment for processor 502It before the step of web page content information, also performs the following operations: according to the network address, target access corresponding with network address endEstablish connection.
In one embodiment, processor 502 passes through the generation in the execution source code by the web analysis contentCode frame is parsed, and when obtaining the step of corresponding source code parsing information, is performed the following operations: being obtained in the code skeletonThe regular expression rule cluster for being identified to source code constructed in advance;By the regular expression rule cluster,It obtains in the source code and is segmented source code correspondingly with regular expression rule each in the regular expression rule cluster;It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained corresponding with each segmentation source codeSegmentation source code parse information;Each segmentation source code parsing information is combined, to obtain source code parsing information.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following functionCan: receive the targeted website network address set to be crawled sent by user terminal;By each net in the targeted website network address setLocation is distributed to corresponding first server;Receive the web analysis content that the first server is sent;In the web analysisAppearance, which is crawled by the first server and parses the corresponding web page content information of the network address, to be obtained;Receive the first serverThe source code of transmission parses information;The source code parsing information is parsed the source code of the web analysis content by the first serverCorrespondence obtains;And search condition is received, according to the search condition in the web analysis content and source code parsing letterCorresponding search result is obtained in breath.
In one embodiment, processor 502 is executing the targeted website net to be crawled for receiving and being sent by user terminalWhen the step of location set, perform the following operations: received by remote date transmission database sent by user terminal it is to be crawledTargeted website network address set.
It will be understood by those skilled in the art that the embodiment of computer equipment shown in Fig. 6 is not constituted to computerThe restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, orPerson combines certain components or different component layouts.For example, in some embodiments, computer equipment can only include depositingReservoir and processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 6,Details are not described herein.
It should be appreciated that in embodiments of the present invention, processor 502 can be central processing unit (CentralProcessing Unit, CPU), which can also be other general processors, digital signal processor (DigitalSignal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devicesPart, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor orThe processor is also possible to any conventional processor etc..
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be withFor non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculatingMachine program performs the steps of the network address for receiving second server distribution when being executed by processor;The network address is described secondServer receives the subset for the targeted website network address set that user terminal uploads;The network address pair is crawled by the code skeleton of deploymentThe web page content information answered;The web page content information is parsed by the code skeleton, obtains web analysis content;The web analysis content is sent in second server storage region corresponding with the first server to store;It willSource code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing information;AndSource code parsing information is sent in second server storage region corresponding with the first server to store.
In one embodiment, before the network address for receiving second server distribution, further includes: initial deployment application containerEngine;It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;Setting instituteState application container engine corresponding storage region in second server.
In one embodiment, the code skeleton by deployment crawl the corresponding web page content information of the network address itBefore, further includes: according to the network address, connection is established at target access corresponding with network address end.
In one embodiment, the source code by the web analysis content is parsed by the code skeleton,Obtain corresponding source code parsing information, comprising: obtain constructed in advance in the code skeleton for being identified to source codeRegular expression rule cluster;By the regular expression rule cluster, obtain in the source code with the regular expressionEach one-to-one segmentation source code of regular expression rule in regular cluster;By with canonical table corresponding to each segmentation source codeUp to the corresponding parsing segmentation source code of formula rule, segmentation source code parsing information corresponding with each segmentation source code is obtained;By each segmentation sourceCode parsing information is combined, to obtain source code parsing information.
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be withFor non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculatingMachine program performs the steps of the targeted website network address set to be crawled for receiving and being sent by user terminal when being executed by processor;Each network address in the targeted website network address set is distributed to corresponding first server;The first server is received to sendWeb analysis content;The web analysis content is crawled by the first server and is parsed in the corresponding webpage of the network addressHold information to obtain;Receive the source code parsing information that the first server is sent;The source code parsing information is taken by described firstThe source code correspondence that business device parses the web analysis content obtains;And search condition is received, according to the search condition in instituteIt states in web analysis content and source code parsing information and obtains corresponding search result.
In one embodiment, described to receive the targeted website network address set to be crawled sent by user terminal, comprising: to pass throughRemote date transmission database receives the targeted website network address set to be crawled sent by user terminal.
It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is setThe specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.Those of ordinary skill in the art may be aware that unit described in conjunction with the examples disclosed in the embodiments of the present disclosure and algorithmStep can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and softwareInterchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefullyUnexpectedly the specific application and design constraint depending on technical solution are implemented in hardware or software.Professional technicianEach specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceedThe scope of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed unit and method, it can be withIt realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unitIt divides, only logical function partition, there may be another division manner in actual implementation, can also will be with the same functionUnit set is at a unit, such as multiple units or components can be combined or can be integrated into another system or someFeature can be ignored, or not execute.In addition, shown or discussed mutual coupling, direct-coupling or communication connection canBe through some interfaces, the indirect coupling or communication connection of device or unit, be also possible to electricity, mechanical or other shapesFormula connection.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unitThe component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multipleIn network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needsPurpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unitIt is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integratedUnit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent productWhen, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existingThe all or part of part or the technical solution that technology contributes can be embodied in the form of software products, shouldComputer software product is stored in a storage medium, including some instructions are used so that a computer equipment (can bePersonal computer, server or network equipment etc.) execute all or part of step of each embodiment the method for the present inventionSuddenly.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk orThe various media that can store program code such as person's CD.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, anyThose familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replaceIt changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with rightIt is required that protection scope subject to.

Claims (10)

CN201910012240.6A2019-01-072019-01-07Webpage data crawling method, device, system, computer equipment and storage mediumActiveCN109885744B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910012240.6ACN109885744B (en)2019-01-072019-01-07Webpage data crawling method, device, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910012240.6ACN109885744B (en)2019-01-072019-01-07Webpage data crawling method, device, system, computer equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN109885744Atrue CN109885744A (en)2019-06-14
CN109885744B CN109885744B (en)2024-05-10

Family

ID=66925622

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910012240.6AActiveCN109885744B (en)2019-01-072019-01-07Webpage data crawling method, device, system, computer equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN109885744B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110297962A (en)*2019-06-282019-10-01北京金山安全软件有限公司Website resource crawling method, device, system and computer equipment
CN111090798A (en)*2019-12-062020-05-01广州探途网络技术有限公司Webpage data crawling method and system
CN111949849A (en)*2020-08-132020-11-17中国科学院水生生物研究所 Method, device, electronic device and readable storage medium for acquiring fish information
CN112257032A (en)*2019-10-212021-01-22国家计算机网络与信息安全管理中心Method and system for determining APP responsibility subject
CN112422707A (en)*2020-10-222021-02-26北京安博通科技股份有限公司Domain name data mining method and device and Redis server
CN113761315A (en)*2021-09-102021-12-07未鲲(上海)科技服务有限公司Webpage content crawling method and device, computer equipment and storage medium
CN114969172A (en)*2022-03-242022-08-30北京感易智能科技有限公司Information data processing method, information data processing device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103902889A (en)*2012-12-262014-07-02腾讯科技(深圳)有限公司Malicious message cloud detection method and server
WO2015003664A1 (en)*2013-07-122015-01-15贝壳网际(北京)安全技术有限公司Method, device, server, and client device for download processing
CN104951512A (en)*2015-05-272015-09-30中国科学院信息工程研究所Public sentiment data collection method and system based on Internet
CN106096056A (en)*2016-06-302016-11-09西南石油大学A kind of based on distributed public sentiment data real-time collecting method and system
CN106126693A (en)*2016-06-292016-11-16微梦创科网络科技(中国)有限公司The sending method of the related data of a kind of webpage and device
CN106776567A (en)*2016-12-222017-05-31金蝶软件(中国)有限公司A kind of internet big data analyzes extracting method and system
CN109033195A (en)*2018-06-282018-12-18上海盛付通电子支付服务有限公司The acquisition methods of webpage information obtain equipment and computer-readable medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103902889A (en)*2012-12-262014-07-02腾讯科技(深圳)有限公司Malicious message cloud detection method and server
WO2015003664A1 (en)*2013-07-122015-01-15贝壳网际(北京)安全技术有限公司Method, device, server, and client device for download processing
CN104951512A (en)*2015-05-272015-09-30中国科学院信息工程研究所Public sentiment data collection method and system based on Internet
CN106126693A (en)*2016-06-292016-11-16微梦创科网络科技(中国)有限公司The sending method of the related data of a kind of webpage and device
CN106096056A (en)*2016-06-302016-11-09西南石油大学A kind of based on distributed public sentiment data real-time collecting method and system
CN106776567A (en)*2016-12-222017-05-31金蝶软件(中国)有限公司A kind of internet big data analyzes extracting method and system
CN109033195A (en)*2018-06-282018-12-18上海盛付通电子支付服务有限公司The acquisition methods of webpage information obtain equipment and computer-readable medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110297962A (en)*2019-06-282019-10-01北京金山安全软件有限公司Website resource crawling method, device, system and computer equipment
CN110297962B (en)*2019-06-282021-08-24北京金山安全软件有限公司 Crawling method, device, system and computer equipment for website resources
CN112257032A (en)*2019-10-212021-01-22国家计算机网络与信息安全管理中心Method and system for determining APP responsibility subject
CN112257032B (en)*2019-10-212023-07-14国家计算机网络与信息安全管理中心Method and system for determining APP responsibility main body
CN111090798A (en)*2019-12-062020-05-01广州探途网络技术有限公司Webpage data crawling method and system
CN111090798B (en)*2019-12-062023-11-21广州探途网络技术有限公司Webpage data crawling method and system
CN111949849A (en)*2020-08-132020-11-17中国科学院水生生物研究所 Method, device, electronic device and readable storage medium for acquiring fish information
CN111949849B (en)*2020-08-132023-11-21中国科学院水生生物研究所 Methods, devices, electronic equipment and readable storage media for obtaining fish information
CN112422707A (en)*2020-10-222021-02-26北京安博通科技股份有限公司Domain name data mining method and device and Redis server
CN113761315A (en)*2021-09-102021-12-07未鲲(上海)科技服务有限公司Webpage content crawling method and device, computer equipment and storage medium
CN114969172A (en)*2022-03-242022-08-30北京感易智能科技有限公司Information data processing method, information data processing device and electronic equipment

Also Published As

Publication numberPublication date
CN109885744B (en)2024-05-10

Similar Documents

PublicationPublication DateTitle
CN109885744A (en)Web data crawling method, device, system, computer equipment and storage medium
US20220107917A1 (en)Generating target application packages for groups of computing devices
KR102317535B1 (en) Methods and systems for implementing data tracking with software development kits
JP6755954B2 (en) Interface data presentation method and equipment
CN111104635B (en)Method and device for generating form webpage
US7702959B2 (en)Error management system and method of using the same
CN102012954B (en)Subsystem integration method and subsystem integration system for integration design of system-on-chip
CN112800095A (en)Data processing method, device, equipment and storage medium
CN102722381B (en)The technology of optimization and upgrading task
Saxena et al.Practical real-time data processing and analytics: distributed computing and event processing using Apache Spark, Flink, Storm, and Kafka
US20130191376A1 (en)Identifying related entities
CN108399072A (en)Five application page update method and device
CN107480117B (en)Recovery method and device for automatic page table single data
CN110427775A (en)Data query authority control method and device
KR20130019366A (en)Efficiently collecting transction-separated metrics in a distributed enviornment
US10867006B2 (en)Tag plan generation
CN103257852B (en)The method and apparatus that a kind of development environment of distribution application system is built
CN108519903A (en)Static resource adaptation method, device, computer equipment and storage medium
CN106980501A (en)A kind of software package management method, device and system
CN109857397A (en)The method, apparatus and storage medium of project build
EP2815314B1 (en)Assessment of transaction-level interoperability over a tactical data link
CN104052626A (en)Method, device and system for configuring network element data
CN110020238A (en)Click event data acquisition method, apparatus and system
US10761862B2 (en)Method and device for adding indicative icon in interactive application
US20150046393A1 (en)Method and device for executing an enterprise process

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp