Movatterモバイル変換


[0]ホーム

URL:


CN102004770A - Webpage auditing method and device - Google Patents

Webpage auditing method and device
Download PDF

Info

Publication number
CN102004770A
CN102004770ACN 201010545074CN201010545074ACN102004770ACN 102004770 ACN102004770 ACN 102004770ACN 201010545074CN201010545074CN 201010545074CN 201010545074 ACN201010545074 ACN 201010545074ACN 102004770 ACN102004770 ACN 102004770A
Authority
CN
China
Prior art keywords
url
main
user
message
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010545074
Other languages
Chinese (zh)
Inventor
许志宏
张晓东
田涛
李晶楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou DPTech Technologies Co Ltd
Original Assignee
Hangzhou DPTech Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou DPTech Technologies Co LtdfiledCriticalHangzhou DPTech Technologies Co Ltd
Priority to CN 201010545074priorityCriticalpatent/CN102004770A/en
Publication of CN102004770ApublicationCriticalpatent/CN102004770A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

The invention provides webpage auditing method and device, which are applied to network equipment and used for auditing the online behaviors of users. The method comprises the following steps of: extracting URLs (Uniform Resource Locator) accessed by a user from an HTTP (Hyper Text Transport Protocol) request message of the user; filtering a main URL representing a main link from the URLs extracted in the step (A) according to a preset rule; decompressing the message of the main URL accessed by the user to acquire the characteristic information of a webpage appointed by the main URL from the message; and then storing the characteristic information as auditing log information. In the invention, by distinguishing the main and the secondary links, the resource consumption of the network equipment is greatly lowered, and by adopting a memory application mode based on a VCPU (Virtual Central Processing Unit), the memory resource consumption is further lowered.

Description

A kind of method for auditing webpage and device
Technical field
The present invention relates to data communication technology, relate in particular to a kind of method and device of webpage audit.
Background technology
In today that network technology is popularized, enterprise staff inserts the internet by enterprise network and obtains various information; Yet employee's internet behavior may cause various potential safety hazards, may cause problems such as confidential information leakage such as visited some websites of containing wooden horse.The web page browsing audit is used to the webpage that the user visited of auditing, the internet behavior of recording user.And the illegal website that the user visited added control.Used gzip decompression algorithm (GNU ' s Not Unix Zip in the web page browsing audit, the free software compression algorithm), the gzip algorithm is a kind of popular compression algorithm, and this algorithm has become standard-compression algorithm in the http agreement (HTML (Hypertext Markup Language)).
Existing web page browsing audit program is that message carries out depth detection through the application recognition engine of the network equipment, identify that the web page browsing agreement is laggard goes into the web page browsing audit framework, by analyzing HOST (main frame) and the URL that message extracts institute's accessed web page, and the url filtering of analyzing web page falls information such as picture, animation, flash, script, and uses the gzip decompression algorithm or revise the heading that the user capture webpage asked and obtain web page title.The HOST that obtains and URL are organized into a complete network address send on the equipment just that the keeper checks.Used two kinds of methods when obtaining title, the gzip decompression method carries out gzip to user accessing web page content in each session exactly and decompresses, and searches web page title in the content after decompress(ion) is intact; Revising the user capture web-page requests heading meaning is exactly the request header that the web page browsing audit framework is obtained the user capture webpage when the user capture webpage, by revising the parameter in the request header, allow server adopt the such equipment of clear-text way transmission web page contents just can get access to web page title.
Shortcoming has been descended in existing web page browsing audit program existence:
(1) owing to only filtered out information such as picture, animation, flash, script by URL.Because present webpage generally all passes through cutting and handles, a complete webpage is by intact several tens of cutting.Just be equivalent to visit several webpages during webpage of so each visit, so a large amount of useless daily records, daily record rate of false alarm height can appear in the web page browsing audit.
(2) the web page browsing audit need be obtained the title of user institute accessed web page, existing technical scheme has following two kinds: a kind of is by revising the heading of user's request, because it is much bigger that this mode makes when web page contents does not compress the information transmitted amount than the use compression when transmission, therefore this mode can reduce network speed.Another kind is to use the gzip decompression algorithm.Each session approximately need take the 100K internal memory when using the gzip decompression algorithm, causes the Device memory deficiency thereby will take a large amount of internal memories when there is high-volume conversation in equipment.
Because the limitation in the design, current web page browsing audit log rate of false alarm height is difficult to use gzip decompression mode to obtain web page title when having high-volume conversation, had a strong impact on the accuracy of daily record and the performance of the network equipment.
Summary of the invention
The invention provides a kind of webpage audit device, it is applied in the network equipment, is used for user's internet behavior is audited, and comprises URL extraction unit, primary and secondary link filter element and decompression processing unit, wherein,
Described URL extraction unit is used for from the URL of user's HTTP request message extraction user capture, and the URL that extracts is submitted to primary and secondary link filter element;
Described primary and secondary link filter element is used for filtering out the main main URL that links of representative according to pre-defined rule from the URL that submits to, and submits described main URL to decompression processing unit;
Described decompression processing unit, be used for the message of the described main URL of user capture is carried out decompression, and therefrom obtain the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Preferably, at least one physical cpu of the wherein said network equipment, described physical cpu by virtual be a plurality of virtual cpus, this device also comprises: memory management unit, be used for the operation decompression processing unit virtual cpu be that the unit carries out the internal memory application.
Preferably,, also comprise: HTTP message recognition unit is used for identifying the HTTP request message according to the feature of HTTP request message from user's message, and the HTTP request message that recognizes is submitted to the URL extraction unit.
Preferably, wherein said characteristic information is a web page title.
Preferably, described pre-defined rule comprises: judge whether carry the URL parameter among the described URL, if do not carry the URL that the URL parameter then is judged to be the main link of representative, if carry the URL parameter, then judge this URL is abandoned as less important link;
Judge perhaps whether the value in " Content-Type " field in the HTTP request header is the text/* type, if then be judged to be main link, if not then being judged to be less important link.
Preferably, wherein said characteristic information is a web page title, and described pre-defined rule further comprises: whether the main URL webpage pointed of judging the main link of representative has the title that can extract, if not then abandon preserving described web page title.
The present invention also provides a kind of method for auditing webpage, and it is applied in the network equipment, is used for user's internet behavior is audited, and this method comprises:
The URL of steps A, extraction user capture from user's HTTP request message;
Step B, from the URL that steps A is extracted, filter out the main URL of the main link of representative according to pre-defined rule;
Step C, the message of the described main URL of user capture is carried out decompression, and therefrom obtain the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Preferably, at least one physical cpu of the wherein said network equipment, described physical cpu by virtual be a plurality of virtual cpus, among the step C internal memory application of decompression be with the operation decompression virtual cpu be that the unit carries out.
Preferably, also comprise: also comprise step D before the steps A, from user's message, identify the HTTP request message according to the feature of HTTP request message.
Preferably, wherein said characteristic information is a web page title.
Preferably, it is characterized in that described pre-defined rule comprises: judge whether carry the URL parameter among the described URL,,, then judge this URL is abandoned as less important link if carry the URI parameter if do not carry the URL that the URL parameter then is judged to be the main link of representative;
Judge perhaps whether the value in " Content-Type " field in the HTTP request header is the text/* type, if then be judged to be main link, if not then being judged to be less important link.
Preferably, wherein said characteristic information is a web page title, and described pre-defined rule further comprises: whether the main URL webpage pointed of judging the main link of representative has the title that can extract, if not then abandon preserving described web page title.
Compared to prior art, the present invention has reduced the resource consumption of the network equipment in a large number by distinguishing the primary and secondary link, adopts the internal memory application way based on VCPU, has further reduced the consumption of memory source.
Description of drawings
Fig. 1 is the basic network environment logical diagram of webpage audit.
Fig. 2 is a webpage audit device building-block of logic of the present invention.
The basic format synoptic diagram of Fig. 3 HTTP message.
Embodiment
In typical enterprise network environment, the audit of user's internet behavior is usually by outlet device, and promptly the network equipment between Intranet and outer net is finished, and is referred to as gateway usually.Enterprise gateway is transmitted this basic function except finishing message, is bearing a lot of execution of using usually, such as NAT (network address translation) function, safe handling, Qos (service quality) function, access control, internet behavior audit or the like.From designing, both can realize by software, also can support various application by inserting business board.For the application of complexity, the latter is present more common implementation.
Please refer to Fig. 1, in the enterprise network environment, a plurality ofusers 200 by a network equipment 100 (such as, enterprise gateway) inserts the internet, realize the said user's online of our conversational implication, therefore in general the message of user's all-access outer net all can pass through gateway, so preferred implementation is at this internet behavior that is used for to be carried out audit operation, also can switch to flow on another network equipment (such as server) and carry out audit operation.On the whole, target of the present invention is to judge main link and less important link by the URL in the HTTP message, the content of user capture webpage is carried out primary and secondary to be distinguished, the energy of audit is concentrated on the content of the main link of visit, thereby alleviate the resource consumption of equipment on the webpage audit operations significantly.
Please refer to webpage audit device 10 building-block of logic of the present invention shown in Figure 2, described webpage audit device comprises HTTP message recognition unit 20, URL extraction unit 30, primary and secondary link filter element 40, decompression processing unit 50.Described webpage audit device is as follows in the execution flow process of its correspondence of operation:
Step 101 identifies the HTTP request message and submits to the URL extraction unit to handle according to the feature of HTTP request message from user's message, this step is carried out by HTTP message recognition unit 20.
Specifically, in step 101, the type of user's service message has a lot, the message which website is representative of consumer visit then is Client-initiated HTTP request, so at first need from all messages of user, to filter out the HTTP request message, the foundation of filtering then is the feature of HTTP request message, and the feature of HTTP request message specifically can be with reference to the definition of http protocol, such as relatively common method utilizes HTTP well-known port 80 these features to carry out message identification, perhaps uses the HTTP type of message to discern.Identification for the HTTP response message is same reason.Because prior art has provided sufficient instruction in this respect, and related realization has been arranged, the present invention no longer is elaborated.
Step 102 is extracted the URL of user capture from user's HTTP request message, and the URL that extracts is submitted to primary and secondary link filter element; This step is carried out by the URL extraction unit.Please refer to Fig. 3,, specific url field is arranged in the HTTP request message according to the regulation of http protocol.The URL extraction unit extracts URL from the HTTP message according to the field of message carrying ULR.
Step 103 filters out the main main URL that links of representative according to pre-defined rule from the URL that submits to, and submits described main URL to decompression processing unit.
In general, step 102 can be extracted a large amount of URL, and these URL represent a plurality of links.But in fact the degree of depth of audit carry out to(for) each URL there is no need.Main link for user capture: such as http://www.tianya.cn/bbs/index.shtml, being necessary to carry out degree of depth audit handles, because such link can be used as typical case's representative of user behavior generally speaking, and those links that carry the URL parameter normally lack reference significance for audit user's behavior, such as: http://www.tianya.cn/new/publicforum/articleslist.asp? stritem=develop﹠amp; The economic Za Tan ﹠amp of strsubitcm=; Part=0 then is difficult to the typical case's representative as user behavior.The preferred mode of the present invention is distinguished main link and less important link with whether carrying the URL parameter, promptly extract URL in search, for carry the URL parameter (as the link in "? ") URL be judged to be less important link, that does not carry then is judged to be main link.Then submit decompression processing unit to for main link, directly abandon audit for the preferred mode of less important link.Except aforesaid way, can also judge by " Content-Type " in the HTTP request header, if the value in " Content-Type " field be not the text/* type then be judged to be less important link, if then be judged to be main link.Above dual mode only is two kinds of preferable realizations, and those of ordinary skills can other have the implementation of practical value according to concept of the present invention.
Step 104 to carrying out decompression with the corresponding HTTP response message of described main URL, and therefrom obtains the characteristic information of described main URL webpage pointed, then described characteristic information is preserved as audit log information.
Step 104 is carried out by decompression processing unit.For the main URL of user capture, service end can be returned corresponding HTTP response message.And generally, in order to save the bandwidth resources of network, the content of these HTTP response messages is normally through overcompression, such as most typical gzip compression algorithm.Therefore need decompress to response message, therefrom obtain the characteristic information of the webpage of URL sensing then.Generally, can use the characteristic information of the title of webpage as webpage.Can certainly be with the highest words and phrases of the frequency of occurrences as characteristic information.For the situation that can't extract characteristic information, then abandon preservation to characteristic information; Because such webpage is difficult to the typical case's representative as user behavior, meaning is lower for audit operations.
Enforcement to above step can realize by computer software, in the mode that software is realized, in order further to save the consumption to system resource, can introduce the internal memory application mechanism based on CPU, please refer to shown in Figure 2.Decompression processing unit runs on the CPU usually, can carry out virtualization process for a physical cpu, form a plurality of virtual cpus, and a plurality of CPU can move decompression respectively.In conventional art, decompression is that unit carries out the internal memory application with the session, if a session application 100K internal memory, when system's inherence high-volume conversation, the consumption of internal memory is very serious, influences the running of the regular traffic of equipment.And the present invention introduces corresponding memory management unit for decompression processing unit, is that unit carries out the internal memory application with the virtual cpu that moves the decompression step, can avoid a large amount of losses of the memory source of system like this.
Described above only is the preferable implementation of the present invention, and in order to limit protection scope of the present invention, any variation that is equal to and modification all should not be encompassed within protection scope of the present invention.

Claims (12)

CN 2010105450742010-11-162010-11-16Webpage auditing method and devicePendingCN102004770A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN 201010545074CN102004770A (en)2010-11-162010-11-16Webpage auditing method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN 201010545074CN102004770A (en)2010-11-162010-11-16Webpage auditing method and device

Publications (1)

Publication NumberPublication Date
CN102004770Atrue CN102004770A (en)2011-04-06

Family

ID=43812132

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN 201010545074PendingCN102004770A (en)2010-11-162010-11-16Webpage auditing method and device

Country Status (1)

CountryLink
CN (1)CN102004770A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102726026A (en)*2011-12-302012-10-10华为技术有限公司Method, equipment and system for acquiring user behavior
CN102780681A (en)*2011-05-112012-11-14中兴通讯股份有限公司URL (Uniform Resource Locator) filtering system and URL filtering method
CN102857572A (en)*2012-09-142013-01-02北京星网锐捷网络技术有限公司Method and device for processing HTTP (hyper text transport protocol) access request and gateway equipment
CN102857388A (en)*2012-07-122013-01-02上海云辰信息科技有限公司Cloud detection safety management auditing system
CN102932400A (en)*2012-07-202013-02-13北京网康科技有限公司Method and device for identifying uniform resource locator primary links
CN103078854A (en)*2012-12-282013-05-01北京亿赞普网络技术有限公司Message filtering method and device
CN103118007A (en)*2013-01-062013-05-22瑞斯康达科技发展股份有限公司Method and system of acquiring user access behavior
CN103338260A (en)*2013-07-042013-10-02武汉世纪金桥安全技术有限公司Distributed analytical system and analytical method for URL logs in network auditing
CN103973812A (en)*2014-05-232014-08-06上海斐讯数据通信技术有限公司Service interface providing method and system based on uniform resource locator in HTTP
CN104270358A (en)*2014-09-252015-01-07同济大学 Trusted network transaction system client monitor and its implementation method
CN106250497A (en)*2016-08-022016-12-21北京集奥聚合科技有限公司A kind of analysis method of APP application shop search key
CN103825772B (en)*2012-11-162017-06-06华为技术有限公司Identifying user clicks on the method and gateway device of behavior
CN108429624A (en)*2016-12-212018-08-21迈普通信技术股份有限公司A kind of QOS dynamic adjusting methods, equipment and system
CN111131187A (en)*2019-12-072020-05-08杭州安恒信息技术股份有限公司 A Web Audit Method Based on Action Set
CN111327634A (en)*2020-03-092020-06-23深信服科技股份有限公司Website access supervision method, secure socket layer agent device, terminal and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101355587A (en)*2008-09-172009-01-28杭州华三通信技术有限公司Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN101461247A (en)*2006-06-082009-06-17高通股份有限公司Parallel batch decoding of video blocks
CN101656710A (en)*2008-08-212010-02-24中联绿盟信息技术(北京)有限公司Proactive audit system and method
CN101656677A (en)*2009-09-182010-02-24杭州迪普科技有限公司Message diversion processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101461247A (en)*2006-06-082009-06-17高通股份有限公司Parallel batch decoding of video blocks
CN101656710A (en)*2008-08-212010-02-24中联绿盟信息技术(北京)有限公司Proactive audit system and method
CN101355587A (en)*2008-09-172009-01-28杭州华三通信技术有限公司Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN101656677A (en)*2009-09-182010-02-24杭州迪普科技有限公司Message diversion processing method and device

Cited By (26)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102780681A (en)*2011-05-112012-11-14中兴通讯股份有限公司URL (Uniform Resource Locator) filtering system and URL filtering method
WO2012151843A1 (en)*2011-05-112012-11-15中兴通讯股份有限公司Ulr filtering system, method and gateway
CN102726026A (en)*2011-12-302012-10-10华为技术有限公司Method, equipment and system for acquiring user behavior
CN102726026B (en)*2011-12-302015-11-25华为技术有限公司A kind of acquisition methods of user behavior, equipment and system
WO2013097201A1 (en)*2011-12-302013-07-04华为技术有限公司Method, device and system for acquiring user behavior
CN102857388A (en)*2012-07-122013-01-02上海云辰信息科技有限公司Cloud detection safety management auditing system
CN102932400B (en)*2012-07-202015-06-17北京网康科技有限公司Method and device for identifying uniform resource locator primary links
CN102932400A (en)*2012-07-202013-02-13北京网康科技有限公司Method and device for identifying uniform resource locator primary links
CN102857572A (en)*2012-09-142013-01-02北京星网锐捷网络技术有限公司Method and device for processing HTTP (hyper text transport protocol) access request and gateway equipment
CN103825772B (en)*2012-11-162017-06-06华为技术有限公司Identifying user clicks on the method and gateway device of behavior
CN103078854B (en)*2012-12-282016-04-13北京亿赞普网络技术有限公司Message filtering method and device
CN103078854A (en)*2012-12-282013-05-01北京亿赞普网络技术有限公司Message filtering method and device
CN103118007B (en)*2013-01-062016-02-03瑞斯康达科技发展股份有限公司A kind of acquisition methods of user access activity and system
CN103118007A (en)*2013-01-062013-05-22瑞斯康达科技发展股份有限公司Method and system of acquiring user access behavior
CN103338260B (en)*2013-07-042016-05-25武汉世纪金桥安全技术有限公司The distributed analysis system of URL daily record and analytical method in network audit
CN103338260A (en)*2013-07-042013-10-02武汉世纪金桥安全技术有限公司Distributed analytical system and analytical method for URL logs in network auditing
CN103973812B (en)*2014-05-232018-05-25上海斐讯数据通信技术有限公司Service interface providing method and system based on uniform resource locator in http protocol
CN103973812A (en)*2014-05-232014-08-06上海斐讯数据通信技术有限公司Service interface providing method and system based on uniform resource locator in HTTP
CN104270358A (en)*2014-09-252015-01-07同济大学 Trusted network transaction system client monitor and its implementation method
CN104270358B (en)*2014-09-252018-10-26同济大学Trustable network transaction system client monitor and its implementation
CN106250497A (en)*2016-08-022016-12-21北京集奥聚合科技有限公司A kind of analysis method of APP application shop search key
CN108429624A (en)*2016-12-212018-08-21迈普通信技术股份有限公司A kind of QOS dynamic adjusting methods, equipment and system
CN108429624B (en)*2016-12-212022-07-26迈普通信技术股份有限公司QOS dynamic adjustment method, equipment and system
CN111131187A (en)*2019-12-072020-05-08杭州安恒信息技术股份有限公司 A Web Audit Method Based on Action Set
CN111131187B (en)*2019-12-072022-03-25杭州安恒信息技术股份有限公司 A Web Audit Method Based on Action Set
CN111327634A (en)*2020-03-092020-06-23深信服科技股份有限公司Website access supervision method, secure socket layer agent device, terminal and system

Similar Documents

PublicationPublication DateTitle
CN102004770A (en)Webpage auditing method and device
CN109033115B (en)Dynamic webpage crawler system
CN102693271B (en)A kind of network information recommending method and system
CN107590169B (en) A kind of preprocessing method and system of operator gateway data
CN104125209B (en)Malice website prompt method and router
CA2865187C (en)Method and system relating to salient content extraction for electronic content
US8122001B2 (en)Method of retrieving an appropriate search engine
CN109902216A (en) A data collection and analysis method based on social network
CN107257390B (en)URL address resolution method and system
CN102436564A (en)Method and device for identifying tampered webpage
CN108664559A (en)A kind of automatic crawling method of website and webpage source code
CN106599160B (en)Content rule library management system and coding method thereof
CN106446113A (en)Mobile big data analysis method and device
CN102200980A (en)Method and system for providing network resources
CN106022126B (en)A kind of web page characteristics extracting method towards WEB trojan horse detections
CN106599270B (en)Network data capturing method and crawler
US20200250015A1 (en)Api mashup exploration and recommendation
CN105812196A (en)WebShell detection method and electronic device
CN104023046B (en)Mobile terminal recognition method and device
CN106921670A (en)A kind of method and device for acting on behalf of detection
CN114528457A (en)Web fingerprint detection method and related equipment
CN106230835A (en)Method based on the anti-malicious access that Nginx log analysis and IPTABLES forward
CN105824884A (en)User internet surfing information processing method and device
US11829434B2 (en)Method, apparatus and computer program for collecting URL in web page
CN111125704B (en)Webpage Trojan horse recognition method and system

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C12Rejection of a patent application after its publication
RJ01Rejection of invention patent application after publication

Application publication date:20110406


[8]ページ先頭

©2009-2025 Movatter.jp