Disclosure of Invention
An object of the embodiments of the present application is to provide a page proxy method, an apparatus, an electronic device, and a storage medium, which are used to solve the problem that the crawling rate of resource links in a page is not high.
The embodiment of the application provides a page proxy method, which is applied to a proxy server and comprises the following steps: receiving a page request sent by a crawler terminal, and acquiring a response page corresponding to an access link in the page request; judging whether the response page comprises script codes or not; if so, loading and rendering the response page by using the browser, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires the resource links in the response page. In the implementation process, the script codes in the response page returned by the crawler terminal are modified through the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by the browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered, all resource links asynchronously requested by triggering events in the response page are obtained, and the crawling rate of the resource links in the page is effectively improved.
Optionally, in this embodiment of the present application, after determining whether the response page includes the script code, the method further includes: and if the response page does not comprise the script code, sending the response page to the crawler terminal. In the implementation process, the static response page is directly sent to the crawler terminal by not including the script code in the response page, so that the crawler terminal directly obtains the resource link in the static response page.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; modifying the script codes in the rendered response page, and injecting the modified script codes into the response page, wherein the steps of: the method comprises the steps that a predefined event listener function is used for carrying out coverage rewriting on an adding event listener function to obtain a rewritten script code, and the predefined event listener function is used for adding a trigger event to a tag element in a response page when the response page is rendered, so that a crawler terminal can obtain a resource link generated when the trigger event of the tag element is triggered; and injecting the rewritten script code into the response page. In the implementation process, the adding event listener function is overwritten by using the predefined event listener function, so that the trigger event can be added to the tag element in the response page when the response page is rendered, the crawler terminal acquires the resource link generated when the trigger event of the tag element is triggered, and the crawling rate of the resource link in the page is effectively improved.
Optionally, in this embodiment of the application, before injecting the rewritten script code into the response page, the method further includes: acquiring all tag elements in a response page; and eliminating the label elements which cannot trigger the event from all the label elements in the response page. In the implementation process, the tag elements which cannot trigger the event in all the tag elements in the response page are removed, so that all the tag elements in the response page are managed in a unified manner, and the possibility that the acquired resource links are missed is effectively reduced.
Optionally, in this embodiment of the present application, rejecting a tag element that cannot trigger an event from all tag elements in a response page includes: if the tag element does not have a binding event, or the tag element is not visible, then the tag element is removed. In the implementation process, all the label elements in the response page are removed uniformly by removing the label elements which are not bound with events or are invisible, so that the crawling rate of resource links in the page is effectively improved.
Optionally, in this embodiment of the present application, obtaining a response page corresponding to an access link in a page request includes: sending a page request to a website server corresponding to the access link so that the website server returns a response page corresponding to the page request; and receiving a response page sent by the website server. In the implementation process, the proxy server can send the page request in a proxy mode and receive the response page corresponding to the page request by sending the page request to the website server corresponding to the access link and receiving the response page sent by the website server, so that the flexibility of controlling the response page is improved.
Optionally, in this embodiment of the present application, sending a page request to a website server corresponding to an access link includes: resolving a plurality of internet protocol addresses according to the domain name in the access link; the page requests are sent to multiple internet protocol addresses in a load balanced manner. In the implementation process, the page requests are sent to the plurality of internet protocol addresses in a load balancing mode, so that the pressure of the website server is reduced, and the condition that the website server is forbidden due to frequent access is avoided.
An embodiment of the present application further provides a page proxy apparatus, including: the response page acquisition module is used for receiving a page request sent by the crawler terminal and acquiring a response page corresponding to the access link in the page request; the script code judging module is used for judging whether the response page comprises script codes or not; and the page injection sending module is used for loading and rendering the response page by using a browser if the response page comprises the script codes, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal so that the crawler terminal triggers the script codes and acquires the resource links in the response page.
Optionally, in this embodiment of the present application, the page proxy apparatus further includes: and the response page sending module is used for sending the response page to the crawler terminal if the response page does not comprise the script code.
Optionally, in this embodiment of the present application, the response page obtaining module includes: the page request sending module is used for sending a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request; and the response page receiving module is used for receiving a response page sent by the website server.
Optionally, in an embodiment of the present application, the page request sending module includes: the access link analysis module is used for analyzing a plurality of internet protocol addresses according to the domain name in the access link; and the request load balancing module is used for sending the page requests to the plurality of internet protocol addresses in a load balancing mode.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; a page injection sending module, comprising: the script covering and rewriting module is used for covering and rewriting the added event listener function by using a predefined event listener function to obtain a rewritten script code, and the predefined event listener function is used for adding a trigger event to a tag element in a response page when the response page is rendered so as to enable the crawler terminal to obtain a resource link generated when the trigger event of the tag element is triggered; and the response page injection module is used for injecting the rewritten script codes into the response page.
Optionally, in this embodiment of the present application, the page injection sending module further includes: the tag element acquisition module is used for acquiring all tag elements in the response page; and the label element removing module is used for removing the label elements which cannot trigger the event in all the label elements in the response page.
Optionally, in this embodiment of the present application, the tag element removing module is specifically configured to: and if the tag element does not have the binding event or is invisible, rejecting the tag element.
An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.
Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.
Detailed Description
The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
Before introducing the page proxy method provided in the embodiment of the present application, some concepts related in the embodiment of the present application are introduced:
proxy Server (Proxy Server), which is a Server for Proxy network users to obtain network information; the proxy server is a transfer station of network information, an intermediate proxy mechanism between a source host and a destination host, for example, an intermediate proxy mechanism between a host in a personal network and a server of an Internet (Internet) service provider, and is responsible for forwarding legal network information, controlling and registering the forwarding.
A reverse proxy refers to a reverse proxy service provided by a proxy server in a computer network, that is, the proxy server can obtain resources from one or more sets of backend servers (e.g., Web servers) related to the client according to a request of the client, and then return the resources to the client, and the client only knows an Internet Protocol (IP) address of the reverse proxy and does not know existence of a server cluster behind the proxy server.
The forward proxy is a proxy service provided by a proxy server in a computer network, similar to the reverse proxy service, and forwards a request message by one-to-one proxy, that is, a server does not know the actual IP address of a client which actually initiates a request; in the reverse proxy process, the client does not know the actual IP address of the actual real service provider.
Nginx is HTTP service software designed for performance, the service software can provide high-performance HTTP service and reverse proxy service, and compared with Apache HTTP Server and Lighttpd, the Nginx HTTP service software has the advantages of less occupied memory, high stability and the like; meanwhile, nginnx is also a web server of an asynchronous framework, and can also be used as a reverse proxy, a load balancer and an HTTP cache.
Headless browsers refer to browsers without graphical user interfaces; headless browsers provide automatic control of web pages in an environment similar to popular web browsers, but do so through a command line interface or using web communications.
The WebDriver tool is a piece of open source software, the WebDriver can control different browsers (such as Firefox, Chrome, Safari, IE) in a mode of defining a driving engine, and the WebDriver can open a URL to interact with a page which is rendered.
jQuery is a set of cross-browser JavaScript library, which simplifies the operation between Hyper Text Markup Language (HTML) and JavaScript.
Document Object Model (DOM), which is an internal data model of a tree structure that describes the parsing results of an XML document; an XML document includes root nodes, internal nodes, leaf nodes, remark nodes, etc.
It should be noted that the page proxy method provided in the embodiment of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, and the device terminal includes: smart phones, Personal Computers (PCs), tablet computers, and the like. A server refers to a device that provides computing services over a network, such as: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.
Before introducing the page proxy method provided by the embodiment of the present application, an application scenario applicable to the page proxy method is introduced, where the application scenario includes, but is not limited to: the page proxy method is used for enhancing the functions or performances and the like of the crawler software or crawler hardware, resource links crawled by the crawler software or crawler terminal equipment are more complete by using the page proxy method, and resource links and the like are omitted in the crawling process of the crawler software or the crawler terminal equipment.
Please refer to fig. 1, which is a schematic flow chart diagram of a page proxy method provided in the embodiment of the present application; the page proxy method can be applied to a proxy server, namely the method can be executed by the proxy server, and the page proxy method has the main idea that script codes in a response page returned by a crawler terminal are modified by the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by a browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered and resource links of all trigger event asynchronous requests in the response page are acquired, and the problem that the crawler cannot acquire the resource links of all trigger event asynchronous requests in the page is effectively solved; the page proxy method may include:
step S110: the proxy server receives a page request sent by the crawler terminal and acquires a response page corresponding to the access link in the page request.
The embodiment of the proxy server in step S110 receiving the page request sent by the crawler terminal is, for example: the method comprises the steps that a proxy server receives a page request sent by a crawler terminal through a hypertext Transfer Protocol (HTTP) or a hypertext Transfer security Protocol (HTTPS); the page request includes an access link.
The above embodiment of obtaining the response page corresponding to the access link in the page request in step S110 may include:
step S111: and the proxy server sends a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request.
Step S112: and the proxy server receives a response page sent by the website server.
Since the embodiment of step S111 to step S112 is relatively closely related, the two steps will be described together; there are many embodiments of the foregoing steps S111 to S112, including but not limited to the following:
in the first implementation mode, a page request is sent to a website server corresponding to an access link in a reverse proxy mode, and a response page sent by the website server is received; the embodiment specifically includes, for example: the proxy server resolves a plurality of internet protocol addresses according to the domain name in the access link, uses reverse proxy software to send page requests to the plurality of internet protocol addresses in a load balancing mode, then uses the reverse proxy software to receive response pages sent by the website server, and sends the response pages to the crawler terminal; among them, reverse proxy software that can be used includes: nginx, Tengine, Apache HTTP Server, HAproxy, and Hiawatha HTTP Server, and the like.
In the second implementation mode, a forward proxy mode is used for sending a page request to a website server corresponding to an access link and receiving a response page sent by the website server; the embodiment specifically includes, for example: using forward proxy software to receive a response page sent by a website server and sending the response page to a crawler terminal; among the forward proxy software that may be used are: CERN HTTPdServer, Cherokee HTTPServer, NginxHTTPServer, Apache HTTP Server, LighttpdHTTP Server, and the like.
After step S110, step S120 is performed: the proxy server determines whether the response page includes script code.
The script code refers to code written by an transliterated script language running in a browser, and the script code may include: adding an event listener function; the transliterated scripting language herein includes, but is not limited to, a JavaScript language.
The embodiment of the step S120 is, for example: the proxy server analyzes the source code of the response page by using a source code analysis program, then uses a regular expression to search whether a preset label exists in the source code of the response page, if so, the response page is determined to comprise the script code, otherwise, the response page is determined not to comprise the script code: the preset tag is a script, if the script tag has an src attribute, the source code of the script code may be obtained by accessing the src attribute value, and if the script does not have the src attribute, the source code in the script tag may be obtained by using a regular expression.
After step S120, step S130 is performed: and if the response page comprises the script code, the proxy server loads and renders the response page by using the browser, modifies the script code in the rendered response page, injects the modified script code into the response page, and sends the injected response page to the crawler terminal, so that the crawler terminal triggers the script code and acquires the resource link in the response page.
There are many embodiments of the proxy server in step S130 loading and rendering the response page using the browser, including but not limited to the following:
in a first embodiment, the proxy server uses a program to control the browser to load and render the response page, and the embodiment specifically includes: and controlling the browser to load and render the response page by using a program in the Selenium, a jQuery program or a WebDriver tool.
In a second embodiment, the proxy server uses a program or tool to control a headless browser to load and render the response page, and headless browsers that may be used include, but are not limited to: a PhantomJS browser, a Chrome browser in headless mode (header-Chrome), and a Firefox browser in headless mode, etc.; the return data includes, but is not limited to: a style file and a picture file for executing JavaScript scripts, CSS, and the like may be loaded.
There are many embodiments for modifying the script code in the rendered response page and injecting the modified script code into the response page in step S130, including but not limited to the following:
in the first implementation mode, the script codes are modified in a mode of performing covering and rewriting on the added event listener function, and the modified script codes are injected into a response page; the embodiment specifically includes: using a predefined event listener function to carry out coverage rewriting on an added event listener (addEventLister) function to obtain a modified script code, and finally injecting the modified script code into a response page; injecting the rewritten script codes into a response page; the addEventListener is a listening event and processes a function of the listening event, specifically, the proxy server can overwrite the window event listener in a manner of overwriting the addEventListener function, where the window refers to a browser built-in object; the proxy server can also overwrite the Document event listener in an overwriting manner, wherein Document refers to a Document object in an HTML format response webpage; the proxy server can also overwrite the document object model Node (namely DOM-Node) event listener in a mode of overwriting the addEventListener function. The predefined event listener function is used for adding a trigger event to the tag element in the response page when the response page is rendered, so that the crawler terminal acquires the resource link generated when the trigger event of the tag element is triggered.
In the second implementation mode, the added event listener function is overwritten in a covering manner, then the label elements in the response page are screened and removed to obtain modified script codes, and the modified script codes are injected into the response page; the embodiment specifically includes, for example: firstly, overwriting an added event listener function, and then acquiring all tag elements in a response page by using a JavaScript code; secondly, filtering and removing tag elements which cannot trigger events from all tag elements in the response page, filtering and removing the tag elements which are not allowed in the configuration file, and filtering out invisible tag elements; then, the events of the ancestor level tag elements are sequentially transmitted to the descendant tag elements, the descendant tag elements can inherit the events transmitted by the ancestor level tag elements, and all the events bound by the descendant tag elements are combined, namely repeated events or repeated effective events and ineffective events are removed; and finally, injecting the rewritten script codes into a response page.
The specific implementation manner of removing the tag elements that cannot trigger the event from all the tag elements in the response page is, for example: if the tag element does not have the binding event or is Invisible (Invisible), rejecting the tag element; if the tag element does not have a binding event, but the parent tag element (or ancestor tag element) of the tag element has a binding event, the tag element inherits the binding event of the parent tag element (or ancestor tag element); if the tag element does not have a binding event, and the parent tag element (or the ancestor tag element) of the tag element does not have a binding event, the tag element is considered to have no binding event, and the tag element needs to be removed; if the tag element is bound with an event, but the tag element is bound with an invalid event which cannot be triggered, the tag element is considered to have no binding event, and the tag element needs to be removed, wherein the invalid event can also be an event bound with the tag element without a name; if the tag element is bound with a plurality of events, wherein the plurality of events comprise valid events and invalid times, the tag element cannot be removed.
Alternatively, after step S120, step S140 is performed: and if the response page does not comprise the script code, the proxy server sends the response page to the crawler terminal.
The embodiment of the step S140 is, for example: if the response page does not include the script code, the response page can be a static page, where the static page refers to a page that can acquire all information in the page without interaction with the website server corresponding to the page again, and then the proxy server can directly send the static page to the crawler terminal, that is, the proxy server directly sends the response page to the crawler terminal.
In the implementation process, a page request sent by a crawler terminal is received first, and a response page corresponding to an access link in the page request is obtained; then, under the condition that the response page comprises the script codes, loading and rendering the response page by using a browser, modifying the script codes in the rendered response page, injecting the modified script codes into the response page, and sending the injected response page to the crawler terminal, so that the crawler terminal triggers the script codes and acquires resource links in the response page; that is to say, the script codes in the response page returned by the crawler terminal are modified by the proxy server and then returned to the response page modified by the crawler terminal, and when the script codes in the modified response page are rendered and executed by the browser of the crawler terminal, the browser automatically executes the modified script codes in the response page, so that the script codes are automatically triggered, all resource links of the asynchronous request of the trigger event in the response page are acquired, and the problem that the crawler cannot acquire the resource links of the asynchronous request of all trigger events in the page is effectively solved.
Please refer to fig. 2, which is a schematic flow diagram illustrating a process performed by a crawler terminal on a received response page according to an embodiment of the present application; the implementation manner of processing the received response page by the crawler terminal may include:
step S210: and the crawler terminal receives the response page sent by the proxy server and analyzes the modified script code in the response page.
The embodiment of step S210 described above is, for example: the crawler terminal receives a response page sent by the proxy server through an HTTP (hyper text transport protocol) protocol or an HTTPS (hyper text transport protocol) protocol, and loads and renders the response page by using a headless browser; after the response page is loaded and rendered, a script tag is found in the response page by using a JavaScript program or a jQuery program, a source code in the script tag is obtained by using a regular expression, and the source code in the script tag is determined to be a modified script code; although the script code is modified by the proxy server, the crawler terminal does not know whether the script code is modified or not.
After step S210, step S220 is performed: and the crawler terminal executes the modified script code to acquire all events bound by all the tag elements in the response page.
There are many embodiments of the above step S220, including but not limited to the following:
in the first implementation mode, the crawler terminal executes the modified script code by using a Selenium tool and a WebDriver tool, so as to acquire all events bound by all tag elements in a response page; this embodiment is, for example: and acquiring all label elements in the response page by using a regular expression, an XPath and a Beautiful Soup program suite in a Python program, and then acquiring all events bound by all the label elements by using a JavaScript program or a Jquery program.
In the second implementation mode, the modified script codes are a JavaScript program and a jQuery program, and the crawler terminal executes the modified JavaScript program and the jQuery program so as to obtain all events bound by all tag elements in the response page; this embodiment is, for example: after the loading and rendering of the response page are completed, all DOM nodes (the DOM node is another name of the tag element in the DOM operation process) in the response page can be selected by using a selector in the jQuery, and then whether the DOM node is bound with an event or not is judged; if the DOM node is bound with the event, extracting the event of the DOM node by using a JavaScript program; among these, events herein include, but are not limited to: and (3) a hyperlink clicking event, a form clicking event, a mouse clicking event, a keyboard clicking event and the like in the webpage to be processed.
After step S220, step S230 is performed: the crawler terminal controls the thread of the headless browser to simulate and trigger all events, intercepts a page request generated in the triggering process of the events, and acquires resource links generated in the page request.
The embodiment of the step S230 is, for example: the crawler terminal starts a plurality of threads of a headless browser by using a Selenium tool, simulates and triggers all events bound by all label elements by using the plurality of threads of the headless browser, then intercepts a page request generated by the event in the triggering process by using a Python program, and acquires resource links generated in the page request by using programs such as a JavaScript script, jQuery and Python, or acquires the resource links generated in the page request by using a regular expression, XPath and Beautiful Soup program suite in the Python program, or acquires the resource links generated in the page request by using tools such as node. Among them, headless browsers that may be used include, but are not limited to: a PhantomJS browser, a Chrome browser in headless mode, a Firefox browser in headless mode, etc.
Please refer to fig. 3, which illustrates a schematic structural diagram of a page proxy apparatus according to an embodiment of the present application; the embodiment of the present application provides apage proxy apparatus 300, including:
the responsepage obtaining module 310 is configured to receive a page request sent by the crawler terminal, and obtain a response page corresponding to an access link in the page request.
And a scriptcode judging module 320 for judging whether the response page includes script codes.
The pageinjection sending module 330 is configured to, if the response page includes the script code, load and render the response page using the browser, modify the script code in the rendered response page, inject the modified script code into the response page, and send the injected response page to the crawler terminal, so that the crawler terminal triggers the script code and obtains the resource link in the response page.
Optionally, in this embodiment of the present application, the page proxy apparatus further includes:
and the response page sending module is used for sending the response page to the crawler terminal if the response page does not comprise the script code.
Optionally, in this embodiment of the present application, the response page obtaining module includes:
and the page request sending module is used for sending a page request to the website server corresponding to the access link so that the website server returns a response page corresponding to the page request.
And the response page receiving module is used for receiving a response page sent by the website server.
Optionally, in this embodiment of the present application, the page request sending module includes:
and the access link analyzing module is used for analyzing a plurality of internet protocol addresses according to the domain name in the access link.
And the request load balancing module is used for sending the page requests to the plurality of internet protocol addresses in a load balancing mode.
Optionally, in an embodiment of the present application, the script code includes: adding an event listener function; a page injection sending module, comprising:
and the script coverage rewriting module is used for performing coverage rewriting on the added event listener function by using a predefined event listener function to obtain a rewritten script code, and the predefined event listener function is used for adding a trigger event to a tag element in a response page when the response page is rendered so as to enable the crawler terminal to obtain a resource link generated when the trigger event of the tag element is triggered.
And the response page injection module is used for injecting the rewritten script codes into the response page.
Optionally, in this embodiment of the present application, the page injection sending module further includes:
and the tag element acquisition module is used for acquiring all tag elements in the response page.
And the label element removing module is used for removing the label elements which cannot trigger the event in all the label elements in the response page.
Optionally, in this embodiment of the present application, the tag element removing module is specifically configured to: and if the tag element does not have the binding event or is invisible, rejecting the tag element.
It should be understood that the apparatus corresponds to the above-mentioned page proxy method embodiment, and can perform the steps related to the above-mentioned method embodiment, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.
Please refer to fig. 4 for a schematic structural diagram of an electronic device according to an embodiment of the present application. Anelectronic device 400 provided in an embodiment of the present application includes: aprocessor 410 and amemory 420, thememory 420 storing machine-readable instructions executable by theprocessor 410, the machine-readable instructions, when executed by theprocessor 410, performing the method as described above.
The embodiment of the present application also provides astorage medium 430, where thestorage medium 430 stores a computer program, and the computer program is executed by theprocessor 410 to perform the method as above.
Thestorage medium 430 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.