Disclosure of Invention
In view of the above, it is necessary to provide a web crawler processing method, a web crawler processing device, a server, and a storage medium.
A method of web crawler treatment, the method comprising:
analyzing the weblog data to obtain preset fields in the weblog data;
determining the category of the web crawler to which the weblog data belongs through the preset field;
and updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
In one embodiment, the analyzing the weblog data to determine the preset field in the weblog data includes:
preprocessing the network access request data to obtain the weblog data;
And analyzing the weblog data according to a function in a time sequence database, and determining the preset field.
In one embodiment, the preprocessing the network access request data to obtain the weblog data includes:
Acquiring the network access request data from a memory; the network access request data comprises an Nginx variable;
and screening relevant variables of the web crawler to be treated from the Nginx variables according to the web crawler demand, and determining the relevant variables as the weblog data.
In one embodiment, the determining, through the preset field, a web crawler category to which the weblog data belongs includes:
determining a user agent of the initial search engine crawler according to the preset field;
and determining the class of the web crawler to which the weblog data belongs according to the user agent.
In one embodiment, the determining, according to the user agent, a web crawler category to which the weblog data belongs includes:
acquiring an IP address list corresponding to the user agent from the weblog data; the IP address list comprises a plurality of first IP addresses;
And determining the category of the web crawler to which the weblog data belongs according to the second IP address and the first address corresponding to the initial search engine crawler in the website.
In one embodiment, the determining, according to the second IP address and the first address corresponding to the initial search engine crawler at the website, a web crawler category to which the weblog data belongs includes:
Comparing the first IP address with the corresponding second IP address;
If the first IP address is the same as the corresponding second IP address, determining that the weblog data is a target search engine crawler;
And if the first IP address is different from the corresponding second IP address, determining that the weblog data is a malicious webcrawler.
In one embodiment, the method further comprises:
And if the first IP address is different from the corresponding second IP address, deleting the first IP address from the IP address list.
In one embodiment, the determining, according to the user agent, a web crawler category to which the weblog data belongs includes:
And if the user agent contains programming language content, determining the weblog data as a malicious webcrawler.
In one embodiment, the determining, according to the user agent, a web crawler category to which the weblog data belongs includes:
According to the function, acquiring a web crawler request characteristic from the preset field, wherein the web crawler request characteristic comprises access frequency and/or access abnormality information;
And determining the category of the web crawler to which the web log data belongs according to the web crawler request characteristics.
In one embodiment, if the web crawler type to which the weblog data belongs is a malicious web crawler, updating the address list of the web crawler includes:
If the third IP address in the malicious web crawler gray list is an updated IP address in a preset time and the access frequency of the weblog data corresponding to the updated IP address is greater than a preset frequency threshold, updating the address list according to a user instruction.
In one embodiment, the address list includes a malicious web crawler gray list and a malicious web crawler black list, and updating the address list according to a user instruction includes:
If the user instruction indicates that the weblog data are normal access request data, deleting the updated IP address in the malicious web crawler gray list;
And if the user instruction indicates that the network access request data is a malicious web crawler, adding the updated IP address to the malicious web crawler blacklist.
A web crawler handling device, the device comprising:
the analysis module is used for analyzing the weblog data and acquiring preset fields in the weblog data;
The crawler type determining module is used for determining the type of the web crawler to which the web log data belong through the preset field;
And the address list updating module is used for updating the address list of the web crawler according to the category of the web crawler to which the web log data belong, and the address single name of the web crawler is used for processing the network access request.
A server comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
analyzing the weblog data to obtain preset fields in the weblog data;
determining the category of the web crawler to which the weblog data belongs through the preset field;
and updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
A storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
analyzing the weblog data to obtain preset fields in the weblog data;
determining the category of the web crawler to which the weblog data belongs through the preset field;
and updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
According to the method, the device, the server and the storage medium for disposing the web crawlers, the server can analyze the web log data to obtain the preset field in the web log data, the category of the web crawlers to which the web log data belongs is determined through the preset field, the address list of the web crawlers is updated according to the category of the web crawlers to which the web log data belongs, the category of the web crawlers to which the web log data belongs is determined first, the address list of the web crawlers is updated according to the category of the web crawlers to which the web log data belongs, the problem that the web crawlers are uniformly disposed under the condition that the web crawlers are not classified is avoided, and therefore the accuracy of address list update is improved, and the disposing effect is further improved.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The web crawler processing method provided by the application can be applied to the server shown in fig. 1. The server may be Apache, zeus, IIS or the like Web server. Since the nmginx server is an open source Web server that is known to be high performance and high concurrency, enterprises or individuals commonly use the open source Web server for reverse proxy, load balancing, HTTP caching, web development, and the like, in this embodiment, the server may be an nmginx server. As shown in fig. 1, the server includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the server is configured to provide computing and control capabilities. The memory of the server includes nonvolatile storage medium and internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the server is used to store the weblog data. The network interface of the server is used to communicate with an external endpoint via a network connection. The computer program, when executed by a processor, implements a web crawler handling method.
It will be appreciated by those skilled in the art that the architecture shown in fig. 1 is merely a block diagram of some of the architecture associated with the inventive arrangements and is not limiting as to the servers to which the inventive arrangements are applied, and that a particular server may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, as shown in fig. 2, a web crawler processing method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
S100, analyzing the weblog data to obtain preset fields in the weblog data.
Specifically, the server may perform comparison analysis, classification analysis, and/or logic tree analysis on the obtained blog data, so as to obtain a preset field in the blog data. The above-mentioned weblog data may be history log data, or may be current log data. Optionally, the weblog data may be data related to a web crawler to be disposed, which is obtained after screening the network access request. The preset field may be represented by any combination character, which is not limited. At least one preset field may be included in the weblog data.
In addition, after the server obtains the preset fields in the weblog data, the total number of the preset fields can be counted. If multiple preset fields are included, the total number of different preset fields may be the same or different.
S200, determining the category of the web crawler to which the weblog data belongs through a preset field.
Specifically, the server may determine, through a preset field and/or a total number of corresponding preset fields, a web crawler category to which the weblog data belongs. In this embodiment, the weblog data may be a search engine crawler or a malicious webcrawler.
And S300, updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
Specifically, the address list of the web crawler may include an IP address of the search engine crawler and/or an IP address of a malicious web crawler. The server can update the IP addresses in the address list of the webcrawler according to the webcrawler category to which the weblog data belongs, and can process the historical network access requests and/or the current network access requests in real time through the updated address list. The above-described handling may be understood as a process of updating an address list according to the web crawler category to which the web access request corresponds.
It is understood that the above network access request may include a normal access request, a crawler access request, and an invalid access request. The crawler access request may be an indefinite initiated web access request for capturing web page information of each big website in order to obtain information of each big website, where the capturing situation may affect the ranking of each big website in the search engine, and the websites may be each big e-commerce website, such as naughty, jindong, spell, amazon, etc.; the above-mentioned invalid access request may be a network access request of a non-user and a non-crawler, which is often a bursty behavior and has no meaning for each large website.
In this embodiment, the acquiring, storing, using, processing, etc. of the weblog data all conform to relevant regulations of national laws and regulations.
According to the method for disposing the web crawler, the server can analyze the web log data to obtain the preset field in the web log data, the type of the web crawler to which the web log data belongs is determined through the preset field, and the address list of the web crawler is updated according to the type of the web crawler to which the web log data belongs. Meanwhile, the method can directly process the data related to the web crawlers to be treated, which are obtained after screening the network access request, so as to achieve the purpose of updating the address list of the web crawlers, greatly improve the accuracy of updating the address list and further improve the treatment effect.
As an embodiment, as shown in fig. 3, the step of analyzing the blog data in S100 to determine the preset field in the blog data may be implemented specifically by the following steps:
S110, preprocessing the network access request data to obtain the weblog data.
Specifically, the preprocessing may be normalization, denoising, normalization, and/or contrast. The preprocessing may be understood as a process of formatting the network access request data, and may be understood as a process of screening the network log data from the network access request data. In this embodiment, the preprocessing may configure a log format for the network access request data to obtain the network log data. The server can preprocess the network access request data by adopting a three-party module lua-nginx-module to obtain the network log data. In this embodiment, the above weblog data may be nmginx access log data, and an nmginx related variable of which variable name is defined by nmginx is used.
In addition, the server can directly transmit the weblog data to a database or a memory for storage in a weblog-resty-logger-socket mode, and the weblog data is not required to be stored locally to the server, so that other intermediate processing links (such as links of reading the weblog data, carrying out string operation and the like) can be omitted, the consumption of storage time is saved, and the data storage performance is improved. Meanwhile, even if the network log data storage fails, the response of the server to the network access request is not affected.
As shown in fig. 4, the step of preprocessing the network access request data to obtain the weblog data in S110 may specifically include the following steps:
S111, acquiring network access request data from a memory; the network access request data includes an nmginx variable.
It can be understood that the memory may be an nmginx server memory, and the nmginx server memory is used after the nmginx process is started. The server may obtain the nmginx variable from the nmginx server memory.
S112, screening relevant variables of the web crawlers to be handled from Nginx variables according to the requirements of the web crawlers, and determining the relevant variables as web log data.
It will also be appreciated that the web crawler requirements may be logical requirements of web crawler handling, i.e. a method of identifying web crawler categories. The server can perform data processing on the Nginx variables according to the requirements of the web crawlers, and screen variables relevant to analysis and treatment of the web crawlers, namely relevant variables of the web crawlers to be treated. Before the data processing is performed on the nmginx variable, unified formatting processing may be performed on the nmginx variable. The relevant variables of the web crawler to be handled can be the header information of a user agent, an IP address requested by a client, a request URL, and the like.
According to the embodiment, the network access request data can be obtained from the memory, the relevant variables of the web crawlers to be treated are screened from the Nginx variables according to the requirements of the web crawlers, and the relevant variables are determined to be the weblog data, so that the variables relevant to the web crawlers can be directly processed, the problems existing in processing the mixed variables relevant to the non-web crawlers and the web crawlers are avoided, the effect of the treatment of the web crawlers is improved, and the complexity of the treatment of the web crawlers can be reduced.
S120, analyzing the weblog data according to the function in the time sequence database, and determining a preset field.
In particular, since the analysis of the weblog data is mostly based on the time dimension, the weblog data may be analyzed by using a function in the time-series database. The timing database may be TIMESCALE database, kairos database, date database, kudu database, etc., but in this embodiment, the timing database may be Influx database, and the Influx database may be an open source distributed timing, event and index database developed based on Go language, and the function provided by the database may well simplify processing of weblog data. The function functions in the Influx database may include a distiict function, a MEAN function, a MEDIAN function, a count function, etc., where the distiict function may return a unique value for one field (field), the MEAN function may return an arithmetic average of values in one field, the MEDIAN function may return a median value from the sorted values in a single field, and the count function may return the number of non-null values in one field.
According to the web crawler processing method, web access request data can be preprocessed to obtain web log data, the web log data are analyzed according to the function in the time sequence database, the preset field is determined, and further web crawler processing is achieved by updating the address list of the web crawler through the preset field.
As an embodiment, as shown in fig. 5, the step of determining, in the above step S200, the category of the web crawler to which the web log data belongs through the preset field may be implemented by the following steps:
s210, determining the user agent of the initial search engine crawler according to the preset field.
Specifically, on each large regular search engine website, marked user agents, such as a user agent corresponding to a dog search, include Sogou web spider/4.0 and Sogou INST SPIDER/4.0, so that the server can determine the user agent corresponding to the initial search engine crawler according to a preset field. Wherein the different preset fields and the total number of preset fields may correspond to different user agents. Different user agents have corresponding search engine crawlers or malicious web crawlers, in which embodiment the user agents of the custom search engine crawlers may be determined first. The custom search engine crawler may be referred to as an initial search engine crawler. In addition, the request amount of the IP address may be determined according to the preset field and the total number of the preset fields.
S220, determining the category of the web crawler to which the weblog data belongs according to the user agent.
It will be appreciated that the user agent may be counterfeited or tampered with in performing the web crawler handling method by the server. Thus, the server may verify the user agent determined to be the initial search engine crawler to determine the web crawler category to which the web log data corresponding to the user agent pertains.
If the user agent is not forged or tampered, the weblog data corresponding to the user agent can be a search engine crawler; if the user agent is forged or tampered with, the weblog data corresponding to the user agent may be a malicious web crawler.
According to the web crawler disposal method, the category of the web crawler to which the web log data belongs can be determined through the preset field, the address list of the web crawler is further updated according to the category of the web crawler to which the web log data belongs, the problem that the web crawler is uniformly disposed under the condition that the web crawler is not classified is avoided, the accuracy of updating the address list is improved, and the disposal effect is further improved.
As an embodiment, as shown in fig. 6, the step of determining, according to the user agent, the web crawler class to which the weblog data belongs in S220 may include the following steps:
S221, acquiring an IP address list corresponding to the user agent from the weblog data; the IP address list includes a plurality of first IP addresses.
Specifically, the server can automatically acquire the IP addresses corresponding to different user agents from the weblog data by calling the functional interface, and form an IP address list from all the acquired IP addresses. The IP address list may store first IP addresses corresponding to a plurality of initial search engine crawlers, and initial search engine crawlers corresponding to the first IP addresses. The server may also be falsified or tampered with during execution of the web crawler handling method.
S222, determining the category of the web crawler to which the weblog data belongs according to a second IP address and a first address corresponding to the website of the initial search engine crawler.
Specifically, on each large regular search engine website, corresponding IP addresses which can be accessed by a labeled search engine crawler, such as a dog search, are respectively found, and the corresponding IP addresses are http:// hellp. Thus, the server may query the corresponding IP address, i.e., the second IP address, that the initial search engine crawler notes at the regular website through a Linux network command. In this embodiment, the Linux network command may be nslookup commands. Further, in order to secondarily confirm whether the IP address of the initial search engine crawler in the IP address list is the IP address corresponding to the regular search engine crawler, the server may perform comparison, analysis, superposition and/or operation and so on according to the second IP address and the first address corresponding to the website by the initial search engine crawler, to determine the web crawler type to which the weblog data belongs.
For example, in the execution code corresponding to the web crawler processing method, a user agent Sogou web spider/4.0 may be obtained by analyzing the web log data, and IP addresses of two corresponding clients, 220.181.125.106 and 24.125.7.34, respectively, may be determined 24.125.7.34 to be not a crawler searching for a dog after querying the two IP addresses through nslookup command, and at this time, the IP address needs to be deleted from the IP address list.
According to the web crawler disposal method, the IP address list corresponding to the user agent can be obtained from the web log data, the web crawler category to which the web log data belongs is determined according to the second IP address and the first address corresponding to the web site by the initial search engine crawler, the address list of the web crawler is further updated according to the web crawler category to which the web log data belongs, the problem that the web crawler is uniformly disposed under the condition of no classification is avoided, the accuracy of updating the address list is improved, and the disposal effect is further improved.
As an embodiment, as shown in fig. 7, in S222, the step of determining, according to the second IP address and the first address corresponding to the web site by the initial search engine crawler, the category of the web crawler to which the web log data belongs may be specifically implemented by the following steps:
S2221, comparing the first IP address with the corresponding second IP address.
Specifically, the IP addresses may be composed of a network number and a host number, and each IP address includes two identification codes ID, i.e., a network ID and a host ID, i.e., the IP addresses are composed of a set of character strings. The server may compare the network ID or the host ID in the first IP address and the corresponding second IP address, if the network ID or the host ID in the first IP address and the corresponding second IP address are equal, then compare the host ID or the network ID in the first IP address and the corresponding second IP address, if the network ID or the host ID in the first IP address and the corresponding second IP address are not equal, then do not need to compare the host ID or the network ID in the first IP address and the corresponding second IP address, and the comparison is completed, thereby obtaining a comparison result. Or the server can compare the characters at the corresponding positions in sequence according to the sequence of the characters in the first IP address and the corresponding second IP address, and if the characters are unequal, the comparison is ended, so that a comparison result is obtained.
S2222, if the first IP address is the same as the corresponding second IP address, determining that the weblog data is the target search engine crawler.
Specifically, if the server determines that the first IP address is the same as the corresponding second IP address, it may determine that the first IP address obtained before is consistent with the IP address marked on each large regular search engine website, and at this time, determine that the weblog data is the target search engine crawler. In this case, it is shown that the first IP address is not falsified or tampered with. That is, the initial search engine crawler described above may be a target search engine crawler.
S2223, if the first IP address is different from the corresponding second IP address, determining that the weblog data is a malicious webcrawler.
Specifically, if the server determines that the first IP address is different from the corresponding second IP address, it may determine that the first IP address obtained before is inconsistent with the IP address marked on each large regular search engine website, and determine that the weblog data is a malicious web crawler at this time. In this case, it is shown that the first IP address is falsified or tampered with.
In addition, if the first IP address is not the same as the corresponding second IP address, the first IP address is deleted from the IP address list.
It can be understood that if the server determines that the first IP address is different from the corresponding second IP address, the first IP address in the IP address list corresponding to the user agent may be deleted, so as to achieve the purpose of real-time processing on the web crawler.
According to the method for disposing the web crawlers, the category of the web crawlers to which the web log data belongs can be determined according to the second IP address and the first address corresponding to the web site by the initial search engine crawler, the address list of the web crawlers is further updated according to the category of the web crawlers to which the web log data belongs, the problem that the web crawlers are uniformly disposed under the condition that the web crawlers are not classified is avoided, the accuracy of updating the address list is improved, and the disposal effect is further improved.
As one embodiment, the step of determining, according to the user agent, the web crawler category to which the weblog data belongs in S220 may include the following steps: if the user agent contains programming language content, the weblog data is determined to be a malicious webcrawler.
Specifically, in addition to determining that the weblog data is a malicious webcrawler through the steps in S2221 and S2223, the server may determine whether the user agent includes programming language content, and if it is determined that the user agent includes programming language content, determine that the weblog data is a malicious webcrawler. The programming language content may be related to Python-urllib, ≡ trasin, ≡curl, ++libcurl, and the like.
In some scenarios, the identifying manner of determining the web crawler class to which the web log data belongs may be determined by the identifying manner in this embodiment in addition to the identifying manner described in the foregoing embodiment, as shown in fig. 8, and the step of determining, according to the user agent in the foregoing S220, the web crawler class to which the web log data belongs may include the following steps:
s223, acquiring a web crawler request characteristic from a preset field according to the function, wherein the web crawler request characteristic comprises access frequency and/or access abnormality information.
Specifically, the server may obtain the web crawler request feature from the preset field according to the function in the time sequence database. The web crawler request features may include access IP address, access frequency, access time, specific access information, etc. of the web crawler request. During web crawler request access, web crawler requests may have sequential access to URL persistence. In this embodiment, the web crawler request feature may include an access frequency and/or access anomaly information.
S224, determining the category of the web crawler to which the web log data belongs according to the web crawler request characteristics.
It can be understood that the server may compare the access frequency with a preset frequency threshold, determine whether the access anomaly information includes Cookie anomaly information, and determine that the weblog data is a malicious web crawler according to the comparison result and the determination result. The access frequency may be greater than, less than or equal to a preset frequency threshold, and the access anomaly information may or may not include Cookie anomaly information. In this embodiment, if it is determined that the access frequency exceeds the preset frequency threshold and the access anomaly information includes Cookie anomaly information, it may be determined that the weblog data is a malicious webcrawler, or else, it is determined that the weblog data is a search engine crawler. The Cookie exception information may include related information that the Cookie is not carried in the access request or that the Cookie never changes during the access process.
According to the web crawler processing method, the type of the web crawler to which the web log data belongs can be determined through any one of a plurality of identification modes, so that the flexibility of the identification processing method is high, and the limitation is avoided.
As one embodiment, if the web crawler type to which the weblog data belongs is a malicious web crawler, the step of updating the address list of the web crawler in S300 may include: if the third IP address in the malicious web crawler gray list is an updated IP address in the preset time and the access frequency of the weblog data corresponding to the updated IP address is greater than a preset frequency threshold, the address list is updated according to the user instruction.
Specifically, the malicious web crawler gray list may include a pre-stored IP address of the malicious web crawler, i.e., a third IP address. The malicious web crawlers corresponding to the third IP address can comprise malicious web crawlers determined according to the three identification modes, can also comprise malicious web crawlers determined according to the set gray list threshold, and can also comprise customized malicious web crawlers added manually or automatically by a server. The malicious web crawler determined according to the set gray list threshold may be understood as obtaining a web crawler request feature, and compare the web crawler request feature with the gray list threshold, if the web crawler request feature exceeds the gray list threshold, it is determined that web log data corresponding to the web crawler request feature is a malicious web crawler, if the web crawler request feature is an access frequency, the gray list threshold may be a preset frequency threshold. The third IP address may be added to the malicious web crawler gray list at any time.
It can be understood that if the server determines that the third IP address in the malicious web crawler gray list is an added IP address within a preset time, and determines that the access frequency of the weblog data corresponding to the added IP address is greater than a preset frequency threshold, at this time, the server may send alarm information to the client, so as to remind the user that the third IP address in the malicious web crawler gray list needs to be disposed of through the alarm information, and then, the server may receive a user instruction sent by the client, and update the address list according to the user instruction. The address list may include a malicious web crawler gray list and a malicious web crawler black list in addition to the above-described IP address list. The preset time may be a preset time period, for example, within 2 hours, within 8 hours, within 16 hours, or within 24 hours, and the IP address added within the preset time may be understood as a third IP address newly added to the malicious web crawler gray list within 24 hours, for example, in 24 hours. Alternatively, the user instructions may carry the category of the weblog data. In this embodiment, the category of the above network access request may include normal access request data and malicious web crawlers, and the category may be determined for the user according to history processing experience.
The address list includes a malicious web crawler gray list and a malicious web crawler black list, and the step of updating the address list according to the user instruction may specifically include: if the user instruction indicates that the weblog data are normal access request data, deleting the updated IP address in the malicious web crawler gray list; if the user instruction indicates that the network access request data is a malicious web crawler, the updated IP address is added to the malicious web crawler blacklist.
It may be appreciated that the malicious web crawler blacklist may include an IP address of the malicious web crawler, so as to directly reject a network access request corresponding to the IP address in the malicious web crawler blacklist. In addition, if the user cannot identify the type of the network access request, at this time, the updated IP address may be kept in the malicious web crawler gray list, that is, the updated IP address in the malicious web crawler gray list is not deleted. Meanwhile, the updated address list can be stored in the Lua shared memory in real time, so that the subsequent server is convenient to use directly.
According to the embodiment, the weblog data can be treated in real time according to the actual access request to update the address list, so that the accuracy of updating the address list can be improved, and the treatment effect can be further improved.
The web crawler processing method can update the address list according to the user instruction, so that when a subsequent network access request arrives, the network access security can be improved according to real-time response or refusal processing of the updated address list, and meanwhile, the probability of 'false killing' of the normal access request can be reduced.
For facilitating understanding of those skilled in the art, the method for disposing web crawlers provided by the present application is described by taking an execution body as a server as an example, and specifically includes:
(1) Acquiring network access request data from a memory; the network access request data includes an nmginx variable.
(2) And screening relevant variables of the web crawler to be treated from the Nginx variables according to the web crawler requirements, and determining the relevant variables as web log data.
(3) And analyzing the weblog data according to the function in the time sequence database, and determining a preset field.
(4) And determining the user agent of the initial search engine crawler according to the preset field.
(5) Acquiring an IP address list corresponding to the user agent from the weblog data; the IP address list includes a plurality of first IP addresses.
(6) The first IP address is compared with the corresponding second IP address.
(7) And if the first IP address is the same as the corresponding second IP address, determining that the weblog data is the target search engine crawler.
(8) If the first IP address is different from the corresponding second IP address, the network log data is determined to be a malicious network crawler.
(9) And if the first IP address is different from the corresponding second IP address, deleting the first IP address from the IP address list.
(10) Or if the user agent contains programming language content, determining the weblog data as a malicious webcrawler.
(11) Or according to the function, acquiring the web crawler request characteristics from the preset field, wherein the web crawler request characteristics comprise access frequency and/or access abnormality information.
(12) And determining the category of the web crawler to which the web log data belongs according to the web crawler request characteristics.
(13) If the web crawler type to which the weblog data belongs is a malicious web crawler, a third IP address in a malicious web crawler gray list is an updated IP address in a preset time, and the access frequency of the weblog data corresponding to the updated IP address is greater than a preset frequency threshold, the address list is updated according to a user instruction.
(14) The address list comprises a malicious web crawler gray list and a malicious web crawler black list, and if the user instruction indicates that the web log data is normal access request data, the updated IP address in the malicious web crawler gray list is deleted.
(15) If the user instruction indicates that the network access request data is a malicious web crawler, the updated IP address is added to the malicious web crawler blacklist.
The implementation process of the above (1) to (15) may be specifically referred to the description of the above embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.
It should be understood that, although the steps in the flowcharts of fig. 2-8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
In one embodiment, as shown in fig. 9, there is provided a web crawler handling device including: an analysis module 11, a crawler category determination module 12 and an address list update module 13, wherein:
The analysis module 11 is configured to analyze the weblog data to obtain a preset field in the weblog data;
the crawler category determining module 12 is configured to determine, through a preset field, a web crawler category to which the weblog data belongs;
the address list updating module 13 is configured to update an address list of a web crawler according to a web crawler category to which the web log data belongs, where the address list of the web crawler is used for handling the network access request.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the analysis module 11 comprises: a preprocessing unit and an analysis unit, wherein:
The preprocessing unit is used for preprocessing the network access request data to obtain weblog data;
and the analysis unit is used for analyzing the weblog data according to the function in the time sequence database and determining a preset field.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the preprocessing unit includes: a request data acquisition subunit and a log data determination subunit, wherein:
a request data acquisition subunit, configured to acquire network access request data from the memory; the network access request data includes an nmginx variable;
And the log data determining subunit is used for screening relevant variables of the web crawler to be treated from the Nginx variables according to the requirements of the web crawler and determining the relevant variables as the web log data.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, crawler category determination module 12 includes: a user agent determination unit and a crawler category determination unit, wherein:
the user agent determining unit is used for determining the user agent of the initial search engine crawler according to the preset field;
And the crawler type determining unit is used for determining the type of the web crawler to which the weblog data belong according to the user agent.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the crawler category determination unit includes: an address list acquisition subunit and a first type determination subunit, wherein:
An address list obtaining subunit, configured to obtain an IP address list corresponding to the user agent from the weblog data; the IP address list comprises a plurality of first IP addresses;
and the first type determining subunit is used for determining the category of the web crawler to which the weblog data belongs according to the second IP address and the first address corresponding to the website of the initial search engine crawler.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the first type of determining subunit comprises: the comparing subunit, the first determining subunit, and the second determining subunit include:
a comparing subunit, configured to compare the first IP address with the corresponding second IP address;
The first determining subunit is configured to determine that the weblog data is a target search engine crawler when the first IP address is the same as the corresponding second IP address;
And the second determining subunit is used for determining that the weblog data is a malicious webcrawler when the first IP address is different from the corresponding second IP address.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the first type of determining subunit further comprises: an address deletion subunit, wherein:
And the address deleting subunit is used for deleting the first IP address from the IP address list when the first IP address is different from the corresponding second IP address.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the crawler category determination unit includes: a second class determination subunit, wherein:
A second class determination subunit for determining the weblog data as a malicious webcrawler when the programming language content is contained in the user agent.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the crawler category determination unit includes: a request feature acquisition subunit and a third determination subunit, wherein:
The request feature acquisition subunit is used for acquiring web crawler request features from a preset field according to the function, wherein the web crawler request features comprise access frequency and/or access abnormality information;
and the third determining subunit is used for determining the category of the web crawler to which the web log data belongs according to the web crawler request characteristics.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, if the web crawler type to which the weblog data belongs is a malicious web crawler, the address list updating module 13 includes: an address list updating unit, wherein:
And the address list updating unit is used for updating the address list according to the user instruction when the third IP address in the malicious web crawler gray list is an updated IP address in a preset time and the access frequency of the weblog data corresponding to the updated IP address is greater than a preset frequency threshold.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
In one embodiment, the address list includes a malicious web crawler gray list and a malicious web crawler black list, and the address list updating unit includes: a delete subunit and an add subunit, wherein:
A deletion subunit, configured to delete the updated IP address in the malicious web crawler gray list when the user instruction indicates that the weblog data is normal access request data;
And the adding subunit is used for adding the updated IP address to the malicious web crawler blacklist when the user instruction indicates that the network access request data is the malicious web crawler.
The web crawler processing device provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein.
For specific limitations of the web crawler handling device, reference may be made to the above limitations of the web crawler handling method, which are not described in detail herein. The various modules in the above-described web crawler handling device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in a server, or may be stored in software in a memory in the server, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a server is provided that includes a memory and a processor, the memory having a computer program stored therein, the processor when executing the computer program performing the steps of:
Analyzing the weblog data to obtain preset fields in the weblog data;
Determining the category of the web crawler to which the weblog data belongs through a preset field;
And updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
In one embodiment, a storage medium having a computer program stored thereon, the computer program when executed by a processor performing the steps of:
Analyzing the weblog data to obtain preset fields in the weblog data;
Determining the category of the web crawler to which the weblog data belongs through a preset field;
And updating an address list of the web crawler according to the category of the web crawler to which the web log data belongs, wherein the address list of the web crawler is used for handling the network access request.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.