A kind of method for auditing webpage and deviceTechnical field
The present invention relates to data communication technology, relate in particular to a kind of method and device of webpage audit.
Background technology
In today that network technology is popularized, enterprise staff inserts the internet by enterprise network and obtains various information; Yet employee's internet behavior may cause various potential safety hazards, may cause problems such as confidential information leakage such as visited some websites of containing wooden horse.The web page browsing audit is used to the webpage that the user visited of auditing, the internet behavior of recording user.And the illegal website that the user visited added control.Used gzip decompression algorithm (GNU ' s Not Unix Zip in the web page browsing audit, the free software compression algorithm), the gzip algorithm is a kind of popular compression algorithm, and this algorithm has become standard-compression algorithm in the http agreement (HTML (Hypertext Markup Language)).
Existing web page browsing audit program is that message carries out depth detection through the application recognition engine of the network equipment, identify that the web page browsing agreement is laggard goes into the web page browsing audit framework, by analyzing HOST (main frame) and the URL that message extracts institute's accessed web page, and the url filtering of analyzing web page falls information such as picture, animation, flash, script, and uses the gzip decompression algorithm or revise the heading that the user capture webpage asked and obtain web page title.The HOST that obtains and URL are organized into a complete network address send on the equipment just that the keeper checks.Used two kinds of methods when obtaining title, the gzip decompression method carries out gzip to user accessing web page content in each session exactly and decompresses, and searches web page title in the content after decompress(ion) is intact; Revising the user capture web-page requests heading meaning is exactly the request header that the web page browsing audit framework is obtained the user capture webpage when the user capture webpage, by revising the parameter in the request header, allow server adopt the such equipment of clear-text way transmission web page contents just can get access to web page title.
Shortcoming has been descended in existing web page browsing audit program existence:
(1) owing to only filtered out information such as picture, animation, flash, script by URL.Because present webpage generally all passes through cutting and handles, a complete webpage is by intact several tens of cutting.Just be equivalent to visit several webpages during webpage of so each visit, so a large amount of useless daily records, daily record rate of false alarm height can appear in the web page browsing audit.
(2) the web page browsing audit need be obtained the title of user institute accessed web page, existing technical scheme has following two kinds: a kind of is by revising the heading of user's request, because it is much bigger that this mode makes when web page contents does not compress the information transmitted amount than the use compression when transmission, therefore this mode can reduce network speed.Another kind is to use the gzip decompression algorithm.Each session approximately need take the 100K internal memory when using the gzip decompression algorithm, causes the Device memory deficiency thereby will take a large amount of internal memories when there is high-volume conversation in equipment.
Because the limitation in the design, current web page browsing audit log rate of false alarm height is difficult to use gzip decompression mode to obtain web page title when having high-volume conversation, had a strong impact on the accuracy of daily record and the performance of the network equipment.
Summary of the invention
The invention provides a kind of webpage audit device, it is applied in the network equipment, is used for user's internet behavior is audited, and comprises URL extraction unit, primary and secondary link filter element and decompression processing unit, wherein,
Described URL extraction unit is used for from the URL of user's HTTP request message extraction user capture, and the URL that extracts is submitted to primary and secondary link filter element;
Described primary and secondary link filter element is used for filtering out the main main URL that links of representative according to pre-defined rule from the URL that submits to, and submits described main URL to decompression processing unit;
Described decompression processing unit, be used for the message of the described main URL of user capture is carried out decompression, and therefrom obtain the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Preferably, at least one physical cpu of the wherein said network equipment, described physical cpu by virtual be a plurality of virtual cpus, this device also comprises: memory management unit, be used for the operation decompression processing unit virtual cpu be that the unit carries out the internal memory application.
Preferably,, also comprise: HTTP message recognition unit is used for identifying the HTTP request message according to the feature of HTTP request message from user's message, and the HTTP request message that recognizes is submitted to the URL extraction unit.
Preferably, wherein said characteristic information is a web page title.
Preferably, described pre-defined rule comprises: judge whether carry the URL parameter among the described URL, if do not carry the URL that the URL parameter then is judged to be the main link of representative, if carry the URL parameter, then judge this URL is abandoned as less important link;
Judge perhaps whether the value in " Content-Type " field in the HTTP request header is the text/* type, if then be judged to be main link, if not then being judged to be less important link.
Preferably, wherein said characteristic information is a web page title, and described pre-defined rule further comprises: whether the main URL webpage pointed of judging the main link of representative has the title that can extract, if not then abandon preserving described web page title.
The present invention also provides a kind of method for auditing webpage, and it is applied in the network equipment, is used for user's internet behavior is audited, and this method comprises:
The URL of steps A, extraction user capture from user's HTTP request message;
Step B, from the URL that steps A is extracted, filter out the main URL of the main link of representative according to pre-defined rule;
Step C, the message of the described main URL of user capture is carried out decompression, and therefrom obtain the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Preferably, at least one physical cpu of the wherein said network equipment, described physical cpu by virtual be a plurality of virtual cpus, among the step C internal memory application of decompression be with the operation decompression virtual cpu be that the unit carries out.
Preferably, also comprise: also comprise step D before the steps A, from user's message, identify the HTTP request message according to the feature of HTTP request message.
Preferably, wherein said characteristic information is a web page title.
Preferably, it is characterized in that described pre-defined rule comprises: judge whether carry the URL parameter among the described URL,,, then judge this URL is abandoned as less important link if carry the URI parameter if do not carry the URL that the URL parameter then is judged to be the main link of representative;
Judge perhaps whether the value in " Content-Type " field in the HTTP request header is the text/* type, if then be judged to be main link, if not then being judged to be less important link.
Preferably, wherein said characteristic information is a web page title, and described pre-defined rule further comprises: whether the main URL webpage pointed of judging the main link of representative has the title that can extract, if not then abandon preserving described web page title.
Compared to prior art, the present invention has reduced the resource consumption of the network equipment in a large number by distinguishing the primary and secondary link, adopts the internal memory application way based on VCPU, has further reduced the consumption of memory source.
Description of drawings
Fig. 1 is the basic network environment logical diagram of webpage audit.
Fig. 2 is a webpage audit device building-block of logic of the present invention.
The basic format synoptic diagram of Fig. 3 HTTP message.
Embodiment
In typical enterprise network environment, the audit of user's internet behavior is usually by outlet device, and promptly the network equipment between Intranet and outer net is finished, and is referred to as gateway usually.Enterprise gateway is transmitted this basic function except finishing message, is bearing a lot of execution of using usually, such as NAT (network address translation) function, safe handling, Qos (service quality) function, access control, internet behavior audit or the like.From designing, both can realize by software, also can support various application by inserting business board.For the application of complexity, the latter is present more common implementation.
Please refer to Fig. 1, in the enterprise network environment, a plurality ofusers 200 by a network equipment 100 (such as, enterprise gateway) inserts the internet, realize the said user's online of our conversational implication, therefore in general the message of user's all-access outer net all can pass through gateway, so preferred implementation is at this internet behavior that is used for to be carried out audit operation, also can switch to flow on another network equipment (such as server) and carry out audit operation.On the whole, target of the present invention is to judge main link and less important link by the URL in the HTTP message, the content of user capture webpage is carried out primary and secondary to be distinguished, the energy of audit is concentrated on the content of the main link of visit, thereby alleviate the resource consumption of equipment on the webpage audit operations significantly.
Please refer to webpage audit device 10 building-block of logic of the present invention shown in Figure 2, described webpage audit device comprises HTTP message recognition unit 20, URL extraction unit 30, primary and secondary link filter element 40, decompression processing unit 50.Described webpage audit device is as follows in the execution flow process of its correspondence of operation:
Step 101 identifies the HTTP request message and submits to the URL extraction unit to handle according to the feature of HTTP request message from user's message, this step is carried out by HTTP message recognition unit 20.
Specifically, in step 101, the type of user's service message has a lot, the message which website is representative of consumer visit then is Client-initiated HTTP request, so at first need from all messages of user, to filter out the HTTP request message, the foundation of filtering then is the feature of HTTP request message, and the feature of HTTP request message specifically can be with reference to the definition of http protocol, such as relatively common method utilizes HTTP well-known port 80 these features to carry out message identification, perhaps uses the HTTP type of message to discern.Identification for the HTTP response message is same reason.Because prior art has provided sufficient instruction in this respect, and related realization has been arranged, the present invention no longer is elaborated.
Step 102 is extracted the URL of user capture from user's HTTP request message, and the URL that extracts is submitted to primary and secondary link filter element; This step is carried out by the URL extraction unit.Please refer to Fig. 3,, specific url field is arranged in the HTTP request message according to the regulation of http protocol.The URL extraction unit extracts URL from the HTTP message according to the field of message carrying ULR.
Step 103 filters out the main main URL that links of representative according to pre-defined rule from the URL that submits to, and submits described main URL to decompression processing unit.
In general, step 102 can be extracted a large amount of URL, and these URL represent a plurality of links.But in fact the degree of depth of audit carry out to(for) each URL there is no need.Main link for user capture: such as http://www.tianya.cn/bbs/index.shtml, being necessary to carry out degree of depth audit handles, because such link can be used as typical case's representative of user behavior generally speaking, and those links that carry the URL parameter normally lack reference significance for audit user's behavior, such as: http://www.tianya.cn/new/publicforum/articleslist.asp? stritem=develop﹠amp; The economic Za Tan ﹠amp of strsubitcm=; Part=0 then is difficult to the typical case's representative as user behavior.The preferred mode of the present invention is distinguished main link and less important link with whether carrying the URL parameter, promptly extract URL in search, for carry the URL parameter (as the link in "? ") URL be judged to be less important link, that does not carry then is judged to be main link.Then submit decompression processing unit to for main link, directly abandon audit for the preferred mode of less important link.Except aforesaid way, can also judge by " Content-Type " in the HTTP request header, if the value in " Content-Type " field be not the text/* type then be judged to be less important link, if then be judged to be main link.Above dual mode only is two kinds of preferable realizations, and those of ordinary skills can other have the implementation of practical value according to concept of the present invention.
Step 104 to carrying out decompression with the corresponding HTTP response message of described main URL, and therefrom obtains the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Step 104 is carried out by decompression processing unit.For the main URL of user capture, service end can be returned corresponding HTTP response message.And generally, in order to save the bandwidth resources of network, the content of these HTTP response messages is normally through overcompression, such as most typical gzip compression algorithm.Therefore need decompress to response message, therefrom obtain the characteristic information of the webpage of URL sensing then.Generally, can use the characteristic information of the title of webpage as webpage.Can certainly be with the highest words and phrases of the frequency of occurrences as characteristic information.For the situation that can't extract characteristic information, then abandon preservation to characteristic information; Because such webpage is difficult to the typical case's representative as user behavior, meaning is lower for audit operations.
Enforcement to above step can realize by computer software, in the mode that software is realized, in order further to save the consumption to system resource, can introduce the internal memory application mechanism based on CPU, please refer to shown in Figure 2.Decompression processing unit runs on the CPU usually, can carry out virtualization process for a physical cpu, form a plurality of virtual cpus, and a plurality of CPU can move decompression respectively.In conventional art, decompression is that unit carries out the internal memory application with the session, if a session application 100K internal memory, when system's inherence high-volume conversation, the consumption of internal memory is very serious, influences the running of the regular traffic of equipment.And the present invention introduces corresponding memory management unit for decompression processing unit, is that unit carries out the internal memory application with the virtual cpu that moves the decompression step, can avoid a large amount of losses of the memory source of system like this.
Described above only is the preferable implementation of the present invention, and in order to limit protection scope of the present invention, any variation that is equal to and modification all should not be encompassed within protection scope of the present invention.