Movatterモバイル変換


[0]ホーム

URL:


CN104657355B - A kind of concurrent grasping means of webpage and system - Google Patents

A kind of concurrent grasping means of webpage and system
Download PDF

Info

Publication number
CN104657355B
CN104657355BCN201310575226.XACN201310575226ACN104657355BCN 104657355 BCN104657355 BCN 104657355BCN 201310575226 ACN201310575226 ACN 201310575226ACN 104657355 BCN104657355 BCN 104657355B
Authority
CN
China
Prior art keywords
crawl
tps
parameter
concurrent
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310575226.XA
Other languages
Chinese (zh)
Other versions
CN104657355A (en
Inventor
金伟
孟凡光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding LtdfiledCriticalAlibaba Group Holding Ltd
Priority to CN201310575226.XApriorityCriticalpatent/CN104657355B/en
Publication of CN104657355ApublicationCriticalpatent/CN104657355A/en
Application grantedgrantedCritical
Publication of CN104657355BpublicationCriticalpatent/CN104657355B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

A kind of concurrent grasping means this application provides webpage and system, method therein specifically include:Pending crawl is asked to carry out concurrent processing, and monitors handled crawl and asks corresponding processing event message;It analyzes to obtain current crawl index parameter according to the processing event message;When current crawl index parameter exceeds preset safe range, the number of concurrent that webpage concurrently captures is turned down.The application can improve webpage concurrently capture during website response speed.

Description

A kind of concurrent grasping means of webpage and system
Technical field
This application involves network technique fields, more particularly to a kind of concurrent grasping means of webpage and system.
Background technology
Search engine refers to that information is collected from internet according to certain strategy, with specific computer program,After carrying out tissue and processing to information, provides retrieval service to the user, be by what the relevant information of user search showed userSystem.The process for collecting information from internet for described search engine climbs related web site information dependent on web crawlersIt takes.
The web crawlers is a kind of program of automatic acquisition web page contents, is the important component of search engine.
In the prior art, for common search engine, URL of traditional reptile from one or several Initial pages(SystemOne Resource Locator, Uniform Resource Locator)Start, obtain the URL on Initial page, in the mistake of crawl webpageCheng Zhong constantly extracts new URL from current page and is put into queue, certain stop condition until meeting system.
Web crawlers is poor to the analysis ability of web page contents in currently available technology, can only constantly be grabbed by mechanicalSite information is taken, often concurrent tens or up to a hundred request circulating repetition crawls;Since most website processing capacity hasLimit, therefore a large amount of concurrent request easilys lead to that websites response is slack-off or even collapse.
Invention content
Technical problems to be solved in this application are to provide concurrent grasping means and the system of a kind of webpage, can improve netThe response speed of website during the concurrently crawl of page.
To solve the above-mentioned problems, this application discloses a kind of concurrent grasping means of webpage, including:
Pending crawl is asked to carry out concurrent processing, and monitors handled crawl and asks corresponding processing event message;
It analyzes to obtain current crawl index parameter according to the processing event message;
When current crawl index parameter exceeds preset safe range, the number of concurrent that webpage concurrently captures is turned down.
Preferably, the method further includes:
When current crawl index parameter is less than preset safe range, the number of concurrent that webpage concurrently captures is turned up.
Preferably, described the step of analyzing to obtain current crawl index parameter according to the processing event message, including:
It obtains handled crawl in each period and asks corresponding processing event message;
Corresponding processing event message is asked to carry out independent analysis handled crawl in current slot, and/or, to phaseHandled crawl asks corresponding processing event message to compare and analyze in the adjacent period, obtains current crawl index ginsengNumber.
Preferably, before described the step of asking to carry out concurrent processing to pending crawl, the method further includes:
When currently practical processing number of transactions TPS per second is without departing from highest upper safety limit TPS, permits pending crawl and askThe processing asked;
Then described the step of asking to carry out concurrent processing to pending crawl, is specifically, ask the pending crawl permittedAsk carry out concurrent processing.
Preferably, the crawl index parameter includes one or more in websites response index parameter and Network status parameter.
Preferably, the websites response index parameter includes one or more in response time parameter and response scale parameter;
Wherein, the response time parameter is used to indicate website to the response time of handled crawl request, the responseScale parameter is used to indicate in each period that the response time to meet the handled crawl request of preset safe range when correspondingBetween whole proportions in handled crawl request in section.
Preferably, Network status parameter includes number of errors parameter, error rate parameter, crawl speed between the website reptileIt spends one or more in parameter, grasp speed scale parameter;
Wherein, the number of errors parameter is used to indicate to be abnormal the quantity of the handled crawl request of mistake, describedError rate parameter was used to indicate the number of errors parameter in current slot relative to the number of errors ginseng in a upper periodSeveral increase ratios, the grasp speed scale parameter are used for the ratio less than or equal to current grasp speed parameter.
Preferably, described when current crawl index parameter exceeds preset safe range, it turns down webpage and concurrently capturesNumber of concurrent the step of, including:
The concurrent thread number for concurrently capturing processing is turned down, and/or, turn down the TPS for concurrently capturing processing.
Preferably, described the step of turning down the TPS for concurrently capturing processing, including:
Difference according to highest upper safety limit TPS and current TPS is concurrently captured turning down for the TPS of processing;Wherein, instituteHighest upper safety limit TPS is stated for indicating crawl index parameter without departing from the historical high TPS in the case of preset safe range.
Preferably, it is described to pending crawl ask carry out concurrent processing the step of before, the method further include asThe step of lower acquisition highest upper safety limit TPS:
The initial value that the upper safety limit TPS is arranged is preset big numerical value;
The concurrent thread number for concurrently capturing processing is stepped up until the concurrent thread is counted to up to maximum concurrent thread number;
The concurrent processing of pending crawl request is carried out according to current safety upper limit TPS;
When current crawl index parameter is without departing from preset safe range, current safety upper limit TPS is turned up;
When current crawl index parameter exceeds preset safe range, current safety upper limit TPS is turned down;
Record the upper safety limit TPS after being turned up or turning down;
The highest TPS in recorded upper safety limit TPS is chosen as highest upper safety limit TPS.
On the other hand, disclosed herein as well is a kind of concurrent grasping systems of webpage, including:
Request processing module carries out concurrent processing for asking pending crawl;
Message monitors module, and corresponding processing event message is asked for monitoring handled crawl;
Message-analysis module, for analyzing to obtain current crawl index parameter according to the processing event message;And
Number of concurrent turns down module, for when current crawl index parameter exceeds preset safe range, turning down webpageThe number of concurrent concurrently captured.
Preferably, the system also includes:
Module is turned up in number of concurrent, for when current crawl index parameter is less than preset safe range, turning down webpageThe number of concurrent concurrently captured.
Preferably, the message-analysis module includes:
Message acquisition submodule asks corresponding processing event message for obtaining handled crawl in each period;
Message analysis submodule, for asking corresponding processing event message to carry out handled crawl in current slotIndependent analysis, and/or, it asks corresponding processing event message to compare and analyze handled crawl in time adjacent segments, obtainsTo current crawl index parameter.
Preferably, the system also includes:
Permit module, for it is described to it is pending crawl request carry out concurrent processing operation before, currently practicalProcessing number of transactions TPS per second without departing from highest upper safety limit TPS when, permit it is pending crawl request processing;
The then request processing module carries out concurrent processing specifically for the pending crawl request to being permitted.
Preferably, the crawl index parameter includes one or more in websites response index parameter and Network status parameter.
Preferably, the websites response index parameter includes one or more in response time parameter and response scale parameter;
Wherein, the response time parameter is used to indicate website to the response time of handled crawl request, the responseScale parameter is used to indicate in each period that the response time to meet the handled crawl request of preset safe range when correspondingBetween whole proportions in handled crawl request in section.
Preferably, Network status parameter includes number of errors parameter, error rate parameter, crawl speed between the website reptileIt spends one or more in parameter, grasp speed scale parameter;
Wherein, the number of errors parameter is used to indicate to be abnormal the quantity of the handled crawl request of mistake, describedError rate parameter was used to indicate the number of errors parameter in current slot relative to the number of errors ginseng in a upper periodSeveral increase ratios, the grasp speed scale parameter are used for the ratio less than or equal to current grasp speed parameter.
Preferably, the number of concurrent turns down module and includes:
First turns down submodule, for turning down the concurrent thread number for concurrently capturing processing;And/or
Second turns down submodule, for turning down the TPS for concurrently capturing processing.
Preferably, described second submodule is turned down, is specifically used for the difference according to highest upper safety limit TPS and current TPSConcurrently captured turning down for the TPS of processing;Wherein, the highest upper safety limit TPS is for indicating that crawl index parameter does not surpassGo out the historical high TPS in the case of preset safe range.
Preferably, the system also includes:For it is described to it is pending crawl request carry out concurrent processing operation itBefore, obtain the upper limit TPS acquisition modules of the highest upper safety limit TPS;
The upper limit TPS acquisition modules include:
Submodule is set, and the initial value for the upper safety limit TPS to be arranged is preset big numerical value;
It is stepped up submodule, for being stepped up the concurrent thread number for concurrently capturing processing until the concurrent thread numberReach maximum concurrent thread number;
Concurrent processing submodule, the concurrent processing for carrying out pending crawl request according to current safety upper limit TPS;
Submodule is turned up, for current peace when current crawl index parameter is without departing from preset safe range, to be turned upFull upper limit TPS;
Submodule is turned down, for when current crawl index parameter exceeds preset safe range, turning down current safetyUpper limit TPS;
Record sub module, for recording the upper safety limit TPS after being turned up or turning down;And
Submodule is chosen, for choosing the highest TPS in recorded upper safety limit TPS as highest upper safety limit TPS.
Compared with prior art, the application has the following advantages:
The application current crawl index parameter exceed preset safe range when, turn down webpage concurrently capture it is concurrentNumber, wherein website load condition during concurrently crawl of the crawl index parameter for weighing webpage, due to turning down webpageThe number of concurrent concurrently captured means the quantity for turning down crawl request namely the frequency for turning down request website, can reduce webpageConcurrently capture during Website server load, therefore the application can will be grabbed by turning down the number of concurrent that webpage concurrently capturesFetching mark state modulator in preset safe range, also can by website spatial load forecasting in preset safe range, becauseThe case where this can be avoided a large amount of concurrent request from being easy to cause slack-off websites response or even collapse, so as to improve webpageThe response speed of website during concurrently capturing;
Secondly, webpage can also be turned up simultaneously when current crawl index parameter is less than preset safe range in the applicationThe number of concurrent for sending out crawl means that the quantity that crawl request is turned up namely height-regulating are asked since the number of concurrent that webpage concurrently captures is turned upSeek the frequency of website, therefore the application can make it capture index parametric approximation but not by the way that the number of concurrent that webpage concurrently capture is turned upBeyond preset safe range, for example, grasp speed can be made to approach an ideal numerical value, therefore, the application can be simultaneouslyEnsure the grasp speed of the response speed and reptile of website;
Further, the application can also be in currently practical processing number of transactions TPS per second without departing from highest upper safety limit TPSWhen, just permit the processing of pending crawl request, the allowance mechanism is disapproved to be waited locating beyond those of highest upper safety limit TPSThe processing of reason crawl request, therefore currently practical processing number of transactions TPS per second can be strict controlled in highest upper safety limit TPSIt is interior, website load can be further controlled, the response speed of website during the concurrently crawl so as to further increase webpageDegree.
Description of the drawings
Fig. 1 is a kind of flow chart of the concurrent grasping means embodiment 1 of webpage of the application;
Fig. 2 is a kind of flow chart of the concurrent grasping means embodiment 2 of webpage of the application;
Fig. 3 is a kind of flow chart of the concurrent grasping means embodiment 3 of webpage of the application;
Fig. 4 is a kind of structure chart of the concurrent grasping system embodiment of webpage of the application;
Fig. 5 is a kind of structural schematic diagram of the concurrent grasping system embodiment of webpage of the application.
Specific implementation mode
In order to make the above objects, features, and advantages of the present application more apparent, below in conjunction with the accompanying drawings and it is specific realApplying mode, the present application will be further described in detail.
Referring to Fig.1, the flow chart for showing a kind of concurrent grasping means embodiment 1 of webpage of the application, can specifically wrapIt includes:
Step 101 asks pending crawl to carry out concurrent processing, and monitors handled crawl and ask corresponding processing thingPart message;
In the embodiment of the present application, pending crawl request can be used for indicating untreated crawl request, webpage andDuring hair crawl, pending crawl can be generated according to the new URL extracted from current page and asks and puts to request queueIn, pending crawl request is obtained in request queue, and judge whether pending crawl request is processed before treatment, ifIt is to abandon, otherwise carries out concurrent processing.
In practical applications, the realization process for concurrent processing being carried out to pending crawl request, which can be web crawlers, to be waited forProcessing crawl request is sent to corresponding website and carries out corresponding content crawl, it will be understood that the application is to the specifically side of processingMethod does not limit.
The embodiment of the present application monitors the communication process between web crawlers and website, wherein website can be returned to web crawlersReturn processing event message, the processing event message can specifically include the corresponding response time message of handled crawl request,Handled crawl asks corresponding crawl success message, handled crawl to ask corresponding crawl unexpected message etc., the applicationCorresponding processing event message is asked not limit specific handled crawl.
Step 102 is analyzed to obtain current crawl index parameter according to the processing event message;
In the embodiment of the present application, crawl index parameter can be used for weigh webpage concurrently crawl during website load shapeState;If it in preset safe range, illustrates that website load condition is good, website can normally handle crawl request, nothingThe number of concurrent that webpage concurrently captures need to be turned down;If it exceeds preset safe range, illustrates that website load has been expired, can not locateManage more number of concurrent, even more number of concurrent is easy to cause website collapse, it is therefore desirable to turn down it is that webpage concurrently captures andSend out number.
In a preferred embodiment of the present application, the crawl index parameter can specifically include websites response index ginsengIt is one or more in number and Network status parameter.Wherein, the websites response index parameter can be used for assessing the response of websiteWhether ability is normal, is controlled in preset safe range, then a large amount of concurrent request can be avoided to be easy to cause websiteThe case where responding slack-off or even collapse, therefore the response speed of website can be improved;The Network status parameter can be used for website withWhether the Network status between reptile is normal, is controlled in preset safe range, then when can avoid Network status exceptionA large amount of crawl requests cannot be handled, therefore can improve grasp speed.
In a preferred embodiment of the present application, the websites response index parameter can specifically include response time ginsengIt is one or more in number and response scale parameter;Wherein, the response time parameter can be used for indicating that website is grabbed to handledThe response time of request, the response scale parameter is taken to can be used for indicating that the response time meets preset safe model in each periodThe proportion in whole handled crawl requests within the corresponding period is asked in the handled crawl enclosed.
In another preferred embodiment of the present application, Network status parameter can specifically include mistake between the website reptileIt is accidentally one or more in number parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter;Wherein, describedThe quantity for the handled crawl request that number of errors parameter can be used for indicating to be abnormal mistake, the error rate parameter are availableIn increase ratio of the number of errors parameter in expression current slot relative to the number of errors parameter in a upper period, instituteGrasp speed scale parameter is stated for representing less than the ratio equal to current grasp speed parameter.
In a kind of application example of the application, it is assumed that the length of period is 1 minute, it is assumed that a upper period IntranetThe number of concurrent that page concurrently captures is 100 times, and the number of concurrent in current slot is 120 times:
1)Assuming that it is 200ms that response time parameter, which corresponds to preset safe range, then when the response of handled crawl requestBetween parameter illustrate that the responding ability of website is normal in 200ms;
2)Assuming that it is 80% that response scale parameter, which corresponds to preset safe range, then response time parameter exists in certain periodHandled crawl in 200ms asks to account for the proportions that all handled crawl is asked in the period can recognize more than or equal to 80% sideResponding ability for website is normal;
3)Assuming that it is 20 that number of errors parameter, which corresponds to preset safe range, then certain period internal cause time-out or server go outThe reasons such as mistake return to abnormal handled crawl request can think that the Network status between website and reptile is just within 20Normal;
4)Assuming that error rate parameter is 10%, it is assumed that the number of errors of handled crawl request is in first period20, then the Network status being believed that when the number of errors of handled crawl is more than 22 in second period between website and reptileIt is abnormal.
It is appreciated that above-mentioned response time parameter and response scale parameter are only as the preferred implementation for capturing index parameterExample, and the application limitation those skilled in the art for being not understood to the application can use various crawls according to actual demandIndex parameter.
It is described to analyze to obtain current crawl according to the processing event message in a preferred embodiment of the present applicationThe step S102 of index parameter, can specifically include:
Sub-step S101, corresponding processing event message is asked in handled crawl in acquisition each period;
Sub-step S102, corresponding processing event message is asked individually to be divided handled crawl in current slotAnalysis, and/or, it asks corresponding processing event message to compare and analyze handled crawl in time adjacent segments, obtains currentCrawl index parameter.
Those skilled in the art can period determines according to actual conditions length, such as half a minute, 1 minute, 2 minutesDeng the application does not limit the length of specific period.
In a kind of application example of the application, can according to the length of number of concurrent and the period in certain period,Independent analysis goes out current grasp speed.
It, can be according to handled in current slot and in a upper period in the another kind application example of the applicationThe number of errors of crawl, comparative analysis go out current error rate parameter.
Step 103, current crawl index parameter exceed preset safe range when, turn down it is that webpage concurrently captures andSend out number.
The number of concurrent that the webpage concurrently captures can be used for indicating that web crawlers is asked to the handled crawl of web site requestsQuantity;Mean the quantity of crawl request namely the frequency of request website due to turning down the number of concurrent that webpage concurrently captures,The load of Website server during concurrently the capturing of webpage can be reduced, therefore the application can concurrently be captured by turning down webpageNumber of concurrent will capture index state modulator in preset safe range, also can be by website spatial load forecasting in preset peaceIn gamut.
In practical applications, each crawl index parameter can have corresponding preset safe range.Also, this fieldTechnical staff can use one or more crawl index parameters according to actual demand during the concurrently crawl of webpage, showSo the item number of crawl index parameter is more, and the condition turned down is stringenter.
The application is capable of providing the following technical solution for turning down the number of concurrent that webpage concurrently captures:
Technical solution 1,
Turn down the concurrent thread number for concurrently capturing processing.
It concurrently captures the concurrent thread number of processing due to turning down the quantity of the crawl request of processing is enabled to reduce, therefore energyEnough so that the number of concurrent that webpage concurrently captures reduces.
Technical solution 2,
Turn down the TPS for concurrently capturing processing.
TPS(Processing number of transactions per second, Transactions Per Second)It is the units of measurement of software test result,In the embodiment of the present application, an affairs can be used for indicating that web crawlers sends crawl request then website service to Website serverThe process that device responds specifically can start timing when sending crawl request, terminate meter after receiving Website server responseWhen, the affairs number completed in response time and period is calculated with this.
In a preferred embodiment of the present application, described the step of turning down the TPS for concurrently capturing processing, can specifically it wrapIt includes:
The difference of sub-step S201, foundation highest upper safety limit TPS and current TPS are concurrently captured the TPS's of processingIt turns down;Wherein, in the case of the highest upper safety limit TPS can be used for indicating crawl index parameter without departing from preset safe rangeHistorical high TPS.
For example, in a kind of application example of the application, the expression formula turned down can be expressed as:
TPS after turning down=(Highest upper safety limit TPS-current TPS)/2 (1)
In a preferred embodiment of the present application, in described the step of asking to carry out concurrent processing to pending crawlBefore 101, the method can also include the steps that obtaining the highest upper safety limit TPS as follows:
Step S301, the initial value that the upper safety limit TPS is arranged is preset big numerical value;
Step S302, the concurrent thread number for concurrently capturing processing is stepped up until the concurrent thread is counted to up to maximum simultaneouslyHair line number of passes;
Step S303, the concurrent processing of pending crawl request is carried out according to current safety upper limit TPS;
Step S304, when current crawl index parameter is without departing from preset safe range, the current safety upper limit is turned upTPS;
Step S305, when current crawl index parameter exceeds preset safe range, the current safety upper limit is turned downTPS;
Step S306, the upper safety limit TPS after record is turned up or is turned down;
Step S307, the highest TPS in recorded upper safety limit TPS is chosen as highest upper safety limit TPS.
Above-mentioned steps S304 and step S305 is the foundation in the case where concurrent thread number is fixed as maximum concurrent thread numberThe process that network environment is adjusted, those skilled in the art can determine adjustment the time it takes, example according to actual conditionsSuch as, it can spend be adjusted to obtain within N minutes before the step 101 for carrying out concurrent processing to pending crawl requestIt is natural number to take the highest upper safety limit TPS, the N.
In short, the application turns down webpage and concurrently captures when current crawl index parameter exceeds preset safe rangeNumber of concurrent, wherein the crawl index parameter is used to weigh website load condition during the concurrently crawl of webpage, due to adjustingThe number of concurrent that low webpage concurrently captures means the quantity of crawl request namely asks the frequency of website, can reduce webpageThe load of Website server during concurrently capturing, therefore the application can will be captured by turning down the number of concurrent that webpage concurrently capturesIndex parameter controls in preset safe range, also can be by website spatial load forecasting in preset safe range, thereforeCan avoid a large amount of concurrent request be easy to cause websites response it is slack-off in addition collapse the case where, so as to improve webpage andThe response speed of website during hair crawl.
With reference to Fig. 2, shows a kind of flow chart of the concurrent grasping means embodiment 2 of webpage of the application, can specifically wrapIt includes:
Step 201 asks pending crawl to carry out concurrent processing, and monitors handled crawl and ask corresponding processing thingPart message;
Step 202 is analyzed to obtain current crawl index parameter according to the processing event message;
Step 203, current crawl index parameter exceed preset safe range when, turn down it is that webpage concurrently captures andSend out number;
Step 204, current crawl index parameter be less than preset safe range when, be turned up webpage concurrently capture andSend out number.
Relative to embodiment 1, embodiment 2 can be adjusted when current crawl index parameter is less than preset safe rangeThe number of concurrent that high webpage concurrently captures means the quantity that crawl request is turned up since the number of concurrent that webpage concurrently captures is turned upThe frequency of request website is turned up, therefore the application can make it capture index parameter by the way that the number of concurrent that webpage concurrently captures is turned upIt approaches but without departing from preset safe range, for example, grasp speed can be made to approach an ideal numerical value, therefore, the applicationIt can ensure the grasp speed of the response speed and reptile of website simultaneously.
With reference to Fig. 3, shows a kind of flow chart of the concurrent grasping means embodiment 3 of webpage of the application, can specifically wrapIt includes:
Step 301, when currently practical processing number of transactions TPS per second is without departing from highest upper safety limit TPS, permit waiting locatingThe processing of reason crawl request;
Step 302 carries out concurrent processing to the pending crawl request permitted, and monitors handled crawl request and correspond toProcessing event message;
Step 303 is analyzed to obtain current crawl index parameter according to the processing event message;
Step 304, current crawl index parameter exceed preset safe range when, turn down it is that webpage concurrently captures andSend out number.
Relative to embodiment 1, embodiment 3 is in currently practical processing number of transactions TPS per second without departing from highest upper safety limitWhen TPS, the processing of pending crawl request is just permitted, the allowance mechanism is disapproved beyond those of highest upper safety limit TPSThe processing of pending crawl request, therefore currently practical processing number of transactions TPS per second can be strict controlled in highest safetyIt limits in TPS, compared with embodiment 1, website load can be further controlled, so as to further increase concurrently grabbing for webpageThe response speed of website during taking.
It in practical applications, can be before step 301 according to described in the acquisition of the flow of above mentioned steps S3 01- steps S307Highest upper safety limit TPS.
It is appreciated that as a preferred embodiment, the combination of embodiment 2 and embodiment 3 is also feasible, that is, embodiment 2Method flow can also include step 301 and step 302, the application not limit the combination of specific embodiment.
Corresponding to preceding method embodiment, disclosed herein as well is a kind of concurrent grasping systems of webpage, with reference to shown in Fig. 3Structure chart, can specifically include:
Request processing module 401 carries out concurrent processing for asking pending crawl;
Message monitors module 402, and corresponding processing event message is asked for monitoring handled crawl;
Message-analysis module 403, for analyzing to obtain current crawl index parameter according to the processing event message;And
Number of concurrent turns down module 404, for when current crawl index parameter exceeds preset safe range, turning down netThe number of concurrent that page concurrently captures.
In a preferred embodiment of the present application, the system can also include:Module is turned up in number of concurrent, for working asWhen preceding crawl index parameter is less than preset safe range, the number of concurrent that webpage concurrently captures is turned down.
In a preferred embodiment of the present application, the message-analysis module 403 can specifically include:
Message acquisition submodule asks corresponding processing event message for obtaining handled crawl in each period;And
Message analysis submodule, for asking corresponding processing event message to carry out handled crawl in current slotIndependent analysis, and/or, it asks corresponding processing event message to compare and analyze handled crawl in time adjacent segments, obtainsTo current crawl index parameter.
In another preferred embodiment of the present application, the system can also include:
Permit module, for it is described to it is pending crawl request carry out concurrent processing operation before, currently practicalProcessing number of transactions TPS per second without departing from highest upper safety limit TPS when, permit it is pending crawl request processing;
The then request processing module 401 can be specifically used for carrying out concurrent processing to the pending crawl request permitted.
In the embodiment of the present application, it is preferred that the crawl index parameter can specifically include websites response index ginsengIt is one or more in number and Network status parameter.
In a preferred embodiment of the present application, the websites response index parameter can specifically include response time ginsengIt is one or more in number and response scale parameter;
Wherein, the response time parameter can be used for indicating website to the response time of handled crawl request, the soundScale parameter is answered to can be used for indicating that the response time meets the handled crawl request of preset safe range in phase in each periodAnswer the proportion during all handled crawl is asked in the period.
In another preferred embodiment of the present application, Network status parameter can specifically include mistake between the website reptileIt is accidentally one or more in number parameter, error rate parameter, grasp speed parameter, grasp speed scale parameter;
Wherein, the quantity for the handled crawl request that the number of errors parameter can be used for indicating to be abnormal mistake, instituteState error rate parameter can be used for indicate current slot in number of errors parameter relative to the error number in a upper periodThe increase ratio of parameter is measured, the grasp speed scale parameter can be used for the ratio less than or equal to current grasp speed parameter.
In the still another preferable embodiment of the application, the number of concurrent is turned down module 404 and be can specifically include:
First turns down submodule, for turning down the concurrent thread number for concurrently capturing processing;And/or
Second turns down submodule, for turning down the TPS for concurrently capturing processing.
In the embodiment of the present application, it is preferred that described second turns down submodule, can be specifically used for according in highest safetyThe difference of limit TPS and current TPS is concurrently captured turning down for the TPS of processing;Wherein, the highest upper safety limit TPS is used forIndicate crawl index parameter without departing from the historical high TPS in the case of preset safe range.
In a preferred embodiment of the present application, the system can also include:For described to pending crawlBefore request carries out the operation of concurrent processing, the upper limit TPS acquisition modules of the highest upper safety limit TPS are obtained;
The upper limit TPS acquisition modules can specifically include:
Submodule is set, and the initial value for the upper safety limit TPS to be arranged is preset big numerical value;
It is stepped up submodule, for being stepped up the concurrent thread number for concurrently capturing processing until the concurrent thread numberReach maximum concurrent thread number;
Concurrent processing submodule, the concurrent processing for carrying out pending crawl request according to current safety upper limit TPS;
Submodule is turned up, for current peace when current crawl index parameter is without departing from preset safe range, to be turned upFull upper limit TPS;
Submodule is turned down, for when current crawl index parameter exceeds preset safe range, turning down current safetyUpper limit TPS;
Record sub module, for recording the upper safety limit TPS after being turned up or turning down;And
Submodule is chosen, for choosing the highest TPS in recorded upper safety limit TPS as highest upper safety limit TPS.
To make those skilled in the art more fully understand the application, with reference to Fig. 5, a kind of the concurrent of webpage of the application is shownThe structural schematic diagram of grasping system, the system can specifically include initialization module 501, requirement analysis module 502, at requestModule 503, reptile module 504, message is managed to monitor module 505, message-analysis module 506, permit judgment module 507, number of concurrentIt turns down module 508 and terminates module 509, the corresponding flow that captures can specifically include:
Step S1, initialization module 501 reads the configuration information of crawl from configuration file, and the configuration information specifically may be usedTo include crawl web portal(Such as URL), corresponding preset safe range information of every crawl index parameter etc., and will be describedWeb portal is captured to ask to submit to requirement analysis module 502 as pending crawl;
Step S2, requirement analysis module 502 receives the pending crawl request of the submission of the initialization module 501, alternatively,Pending crawl request is obtained from pending request queue;
Step S3, whether requirement analysis module 502 is processed by analyzing and determining currently pending crawl requestOtherwise the pending crawl request is committed to request processing module 503 by request if then abandoning;
Step S4, request processing module 503 is when acquiring untreated crawl request, to allowance judgment module 507Send processing license request;
Step S5, judgment module 507 is permitted when receiving processing license request, judges currently practical processing thing per secondWhether business number TPS has exceeded highest upper safety limit TPS, permits processing information if it is not, then being returned to request processing module 503;
Step S6, request processing module 503 handles information according to the allowance that judgment module 507 returns is permitted, to reptile mouldBlock 504 submits pending crawl to ask;
Step S7, reptile module 504 asks pending crawl to carry out concurrent processing, and is generated during concurrent processingNew pending crawl request, and preserve to pending request queue, and, it monitors module 505 to message and sends handling resultEvent message;
Step S8, message monitors module 505 and monitors the corresponding processing thing of handled crawl request that reptile module 504 is sentPart message;
Step S9, message-analysis module 506 is analyzed to obtain current gripping finger according to obtained processing event message is monitoredMark parameter;
Step S10, number of concurrent turns down module 508 when current crawl index parameter exceeds preset safe range, adjustsThe number of concurrent that low webpage concurrently captures;
Step S11, requirement analysis module 502 judges whether pending request queue is empty, if so, monitoring mould to messageBlock 505 sends crawl end of a period event message;
Step S12, message monitoring module 505 will monitor obtained crawl end of a period event message and be sent to termination module 509;
Step S13, module 509 is terminated according to the crawl end of a period event message, terminates crawl flow.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are withThe difference of other embodiment, the same or similar parts between the embodiments can be referred to each other.For system embodimentFor, since it is basically similar to the method embodiment, so description is fairly simple, referring to the portion of embodiment of the method in place of correlationIt defends oneself bright.
The embodiment of the present invention can be used in numerous general or special purpose computing system environments or configuration.Such as:Individual calculusMachine, server computer, handheld device or portable device, multicomputer system, based on microprocessor are at laptop deviceSystem, network PC, minicomputer, mainframe computer include the distributed computing environment etc. of any of the above system or equipment.ThisInventive embodiments are preferably applied in embedded system.
The embodiment of the present invention can describe in the general context of computer-executable instructions executed by a computer, exampleSuch as program module.Usually, program module include routines performing specific tasks or implementing specific abstract data types, program,Object, component, data structure etc..The embodiment of the present invention can also be put into practice in a distributed computing environment, in these distributionsIn computing environment, by executing task by the connected remote processing devices of communication network.In a distributed computing environment,Program module can be located in the local and remote computer storage media including storage device.In a typical configurationIn, the computer equipment includes one or more processors (CPU), input/output interface, network interface and memory.MemoryMay include the volatile memory in computer-readable medium, random access memory (RAM) and/or Nonvolatile memoryEtc. forms, such as read-only memory (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.Computer canIt includes that permanent and non-permanent, removable and non-removable media can be accomplished by any method or technique information to read mediumStorage.Information can be computer-readable instruction, data structure, the module of program or other data.The storage medium of computerExample include, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory(DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory(EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), digital versatile disc(DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus or any other non-biographyDefeated medium can be used for storage and can be accessed by a computing device information.As defined in this article, computer-readable medium does not wrapWith including non-standing computer readable media (transitory media), such as data-signal and carrier wave of modulation.
Concurrent grasping means to a kind of webpage provided herein and system above, are described in detail, hereinIn apply specific case the principle and implementation of this application are described, the explanation of above example is only intended to sidesAssistant solves the present processes and its core concept;Meanwhile for those of ordinary skill in the art, the think of according to the applicationThink, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as pairThe limitation of the application.

Claims (18)

CN201310575226.XA2013-11-152013-11-15A kind of concurrent grasping means of webpage and systemActiveCN104657355B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201310575226.XACN104657355B (en)2013-11-152013-11-15A kind of concurrent grasping means of webpage and system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201310575226.XACN104657355B (en)2013-11-152013-11-15A kind of concurrent grasping means of webpage and system

Publications (2)

Publication NumberPublication Date
CN104657355A CN104657355A (en)2015-05-27
CN104657355Btrue CN104657355B (en)2018-10-23

Family

ID=53248504

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201310575226.XAActiveCN104657355B (en)2013-11-152013-11-15A kind of concurrent grasping means of webpage and system

Country Status (1)

CountryLink
CN (1)CN104657355B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112866354B (en)*2015-12-242023-08-11创新先进技术有限公司Resource packaging method and device and asset packaging method
CN106059849B (en)*2016-05-092019-10-22上海斐讯数据通信技术有限公司A kind of automatic trigger packet snapping system and method
CN108632325A (en)*2017-03-242018-10-09中国移动通信集团浙江有限公司A kind of call method and device of application

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6961341B1 (en)*1996-07-022005-11-01Microsoft CorporationAdaptive bandwidth throttling for network services
CN101719377A (en)*2009-11-242010-06-02成都市华为赛门铁克科技有限公司Method and device for controlling power consumption
CN102811258A (en)*2012-07-272012-12-05北京星网锐捷网络技术有限公司Data parallel-downloading method, apparatus and network device
CN102868573A (en)*2012-09-122013-01-09北京航空航天大学Method and device for Web service load cloud test

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US6961341B1 (en)*1996-07-022005-11-01Microsoft CorporationAdaptive bandwidth throttling for network services
CN101719377A (en)*2009-11-242010-06-02成都市华为赛门铁克科技有限公司Method and device for controlling power consumption
CN102811258A (en)*2012-07-272012-12-05北京星网锐捷网络技术有限公司Data parallel-downloading method, apparatus and network device
CN102868573A (en)*2012-09-122013-01-09北京航空航天大学Method and device for Web service load cloud test

Also Published As

Publication numberPublication date
CN104657355A (en)2015-05-27

Similar Documents

PublicationPublication DateTitle
CN104410671B (en)A kind of snapshot grasping means and data supervising device
CN104657355B (en)A kind of concurrent grasping means of webpage and system
CN108376112A (en)Method for testing pressure, device and readable medium
CN104765689B (en)A kind of interface capability data supervise method and apparatus in real time
JP2010517198A (en) Distributed task system and distributed task management method
CN108134830A (en)Load balancing method, system, device and storage medium based on message queue
CN109873853A (en)Equipment key parameter early warning system and its implementation, electronic device
CN111124830A (en)Monitoring method and device for micro-service
CN114283007B (en)Method and device for solving payment hotspot account problem and electronic equipment
CN104270391B (en)A kind of processing method and processing device of access request
CN109933501B (en)Capacity evaluation method and device of application system
CN106385341A (en)Thread monitoring method and system of client
CN104699529B (en)A kind of information acquisition method and device
CN110175278A (en)The detection method and device of web crawlers
CN107154968A (en)A kind of data processing method and equipment
CN106649342A (en)Data processing method and apparatus in data acquisition platform
CN107579864A (en)Ask monitoring method, device and server
JP2015503787A (en) Scenario-based patrol method, system, and computer program
US9183042B2 (en)Input/output traffic backpressure prediction
CN105760284A (en)Website performance monitoring method and device
CN116842298B (en) Data read and write management method, device, storage medium and electronic device
CN106612261A (en)Website data obtaining method, devices and system
CN105468636B (en)A kind of picture loading method of dynamic web page, device and system
CN109509560A (en)A kind of right management method, device, server and medium
CN112612707B (en)Method and device for running test script, equipment and computer readable storage medium

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp