Embodiment
Below in conjunction with accompanying drawing and by embodiment, further illustrate technical scheme of the present invention.
Fig. 1 is the process flow diagram of the online advertisement cheating method of real-time that provides of the embodiment of the present invention, and the method comprises:
S101, the daily record of Real-time Obtaining info web.
Info web log packet in the embodiment of the present invention is drawn together web page display daily record data and click logs data.In displaying daily record data, operable analysis data comprise source (Refer) page of webpage and this webpage of online input, user agent (User Agent) information, displaying time, user sources IP, user Cooki e, the features such as user geographic position.Click logs data comprise the information such as clicking rate, click time and click location (position of user's mouse clicking point correspondence in this page).The above-mentioned daily record data of Real-time Obtaining carries out real-time analysis, to judge whether online web data occurs extremely.
S102, carries out real-time statistic analysis by sliding time window to the displaying daily record data in described info web daily record or click logs data, reports the feature abnormalities situation of described displaying daily record data or click logs data.
The gathering of data stream statistics major part is for boundary mark window model at present, and the current data that arrived of this model assumption are all of equal importance, the whole flow data state of gathering statistics reflection of setting up based on boundary mark window.When analyzing the whole situation of network, the details of data remote are not necessary, particularly in the embodiment of the present invention, online info web being carried out in the application of Real-Time Monitoring, just more need to pay close attention to the data stream of nearest arrival, be the statistics within nearest a hour or several minutes, rather than historical data.Therefore, the embodiment of the present invention has adopted a kind of sliding time window model.
It is unit that described sliding time window be take basic time window, by least two wide continuous described basic time window form and time width is fixed.
In described sliding time window, by nearest at least two basic time window form.Sliding time window is exactly sometime, current up-to-date at least two basic time window set.Described sliding time window is divided into some continuous, wide windows basic time.
When having that new basic time, window arrived, shift out expired window basic time.If one basic time window time range and the difference of the current time width that surpassed described sliding time window, this, window was expired basic time.
In S101, obtained after the info web daily record of a time width corresponding to described basic time of window, described sliding time window moves into described new window basic time, abandon expired window basic time, and statistical study is carried out in corresponding described info web daily record to current sliding time window.
For example, the width of sliding time window is 10 minutes, and each, the width of window was 1 minute basic time, and described sliding time window forms according to sequencing continuously arranged basic time of the windows that arrive by 10.After having obtained a new width and being the Webpage log data that window basic time of 1 minute is corresponding, described sliding time window moves into described up-to-date window basic time, window basic time arriving is the earliest shifted out, and obtain the info web daily record in the corresponding time range of current sliding time window.
Utilize displaying daily record data or click logs data in above-mentioned info web daily record of obtaining, from different dimensions, online webpage is carried out to real-time information monitoring.
Take user Cookie(respectively in order to uniquely tagged user), first three field of IP, IP3(IP address), region, the time distribution situation of adding up this online webpage as unit, if too concentrated or too average in certain dimension, can be judged as the situation of data exception.
For software request and agency's cheating automatically, can detect according to the method.Judge whether to belong to too and to concentrate or during too average situation, can be undertaken by calculating the concentration degree of online webpage in above-mentioned each dimension.The concentration degree of described distribution can be measured with information entropy, variance or setting threshold.For example, when described concentration degree is greater than the first predetermined threshold value, think that distribution is too concentrated, when described concentration degree is less than the second predetermined threshold value, think that distribution is too average.
In the embodiment of the present invention, according to the webpage of the online input page statistic of user accessing in described displaying daily record data, distribute, if the distribution of web page access is too concentrated or be too average, report data is abnormal.The method can effectively be monitored the distributed cheating of online webpage.Whether described web page access distributes and too concentrates or too on average can judge by concentration degree equally, and decision method as hereinbefore.
In the embodiment of the present invention, according to the source page information of the described displaying daily record data statistic of user accessing page, if source page is greater than predetermined number for empty page quantity, report data is abnormal.
In the embodiment of the present invention, according to the click logs data statistics clicking rate in described info web daily record and click location information, according to described click location information, remove invalid clicks, according to different time granularities, clicking rate is added up to find the click abnormal behavior situation of online webpage.
The method of the online info web Real-Time Monitoring that for example, the invention described above embodiment provides can be applicable to the anti-cheating analysis of online advertisement.
(1) base area domain information carries out anti-cheating analysis:
Analytic target is certain provincial region property website in Zhejiang Province,
Monitoring finds that this regionality website (Zhejiang) flow major part, from Guangxi, exists obvious cheating suspicion.
(2) according to temporal information, carrying out anti-cheating analyzes:
Analytic target is certain life search class site for service,
As shown in Figure 2, for the profile of flowrate in this website one day 24 hours, can obviously see that quite a few flow of this network address, from 0 o'clock to 6 o'clock morning this period, does not obviously meet the due flow distribution in this type of website, have cheating situation.If flow is too average, also belong to abnormal case, likely there is the cheating of regularly brushing flow (just brushing certain flow every the set time).
(3) utilizing IP/IP3 to carry out anti-cheating analyzes:
Analytic target is the community forum under certain website; Be below on 08 25th, 2012 monitoring results:
IP shows that number of times accounts for the ratio of full station web page display
218.11.30.195 5458 0.277987165122
This single ip address accounting too high (comparing with the threshold value of setting for this ratio), can think the cheating situation that exists.
(4) according to user Cookie, carrying out anti-cheating analyzes:
Analytic target is certain domain name error correction service network address; ; Be below on 08 17th, 2012 monitoring results:
Show number of times: 353692
Number of clicks: 114
According to user Cookie, analyze and find that the number of times of following this online advertisement of unique user request is too much, be judged to be cheating user:
(5) according to the anti-statistical study of practising fraud of web page access information:
Analytic target is certain IT information service website; Be below 04 monitoring result on May 24th, 2012:
Throw in the webpage number of advertisement: 181
Show number of times: 2942
Suspect and show cheating webpages list:
Accession page too concentrates on above three pages, can think the cheating situation that exists.
(6) according to the anti-statistical study of practising fraud of clicking rate:
Analytic target is certain IT information service website; Be below on 05 28th, 2,012 14 monitoring results:
According to above three web hit, can instead release and show that number of times is obviously unreasonable, have cheating situation.
When the online info web method of real-time that the embodiment of the present invention is provided is applied to that online advertisement is counter practises fraud, the cheating of software automatic access, the nested cheating of iframe, agency's cheating, the cheating of false webpage and distributed cheating can effectively be detected.
In addition, in the embodiment of the present invention, after by info web daily record described in described sliding time window Real-time Obtaining, can be first according to user agent (User Agent) information, removal is from the visit capacity of spiders, to improve the accuracy of statistics, thereby improve the accuracy for the Real-Time Monitoring result of online webpage.
The embodiment of the present invention provides a kind of real-time online page detection method, process in real time IP and the regional feature of the online info web of throwing in, displaying amount, click information, calling party, can find in time to show the abnormal conditions of daily record data and click logs data.
Accordingly, the embodiment of the present invention provides a kind of real-time monitoring device of online info web, and as shown in Figure 2, this device comprises:
Acquisition module 30, for the daily record of Real-time Obtaining info web;
Analysis module 31, for the displaying daily record data of described info web daily record or click logs data being carried out to real-time statistic analysis by sliding time window, reports the feature abnormalities situation of described displaying daily record data or click logs data.
It is unit that described sliding time window be take basic time window, by least two wide continuous described basic time window form and time width is fixed; After having obtained the info web daily record of a time width corresponding to described basic time of window, described sliding time window moves into new window basic time, and statistical study is carried out in corresponding described info web daily record to current sliding time window.
Described analysis module 31 further comprises: throw in distribution statistics submodule 310.
Described input distribution statistics submodule 310, for the distribution situation of adding up online webpage according to user Cooki e, IP, IP3, region, the time of described displaying daily record data respectively, calculate the concentration degree of online webpage in above-mentioned each dimension, if be greater than the first predetermined threshold value or be less than the second predetermined threshold value in concentration degree described at least one dimension, report data is abnormal.
Described analysis module 31 further comprises: access distribution statistics submodule 311.
Described access distribution statistics submodule 311, for distributing according to the webpage of the online input page statistic of user accessing of described displaying daily record data, calculate the concentration degree of described user's accessed web page, if described concentration degree is greater than the first predetermined threshold value or is less than the second predetermined threshold value, report data is abnormal.
Described analysis module 31 further comprises: source page statistics submodule 312.
Described source page statistics submodule 312, for the source page information of the statistic of user accessing page, if source page is greater than predetermined number for empty page quantity, report data is abnormal.
Described analysis module further comprises: click information statistics submodule 313.
Described click information statistics submodule 313, for adding up clicking rate and the click location information of online webpage, removes invalid clicks according to described click location information, according to different time granularities, clicking rate is added up to find to click cheating.
In another preferred embodiment, described device also comprises: remove module 32.
Described removal module 32, for according to the user agent of described info web daily record (User Agent) information, removes the visit capacity from spiders.
Adopt technical scheme of the present invention, can Real-time Obtaining and analyze online info web daily record, realize real-time online info web monitoring.More specifically, it can process IP and the regional feature of info web that online webpage throws in, displaying amount, click information, calling party in real time, can find in time to show the abnormal conditions of daily record data and click logs data, while being applied to that online advertisement is counter practises fraud, the situations such as the cheating of software automatic access, the nested cheating of iframe, agency's cheating, the cheating of false webpage and distributed cheating can effectively be detected.
Those skilled in the art should be understood that, each module in the above-mentioned embodiment of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby they can be stored in memory storage and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above; be only preferably embodiment of the present invention, but protection scope of the present invention is not limited to this, any people who is familiar with this technology is in the disclosed technical scope of the present invention; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.