TWI503675B

Movatterモバイル変換

Info

Publication number: TWI503675B
Application number: TW099139195A
Authority: TW
Original assignee: Alibaba Group Holding Ltd
Priority date: 2010-11-15
Filing date: 2010-11-15
Publication date: 2015-10-11
Also published as: TW201220083A

Description

單詞的用戶行為數的預測方法和裝置Method and device for predicting user behavior of words

本發明關於網際網路領域，尤其關於一種網站中單詞的用戶行為數的預測方法和裝置。The present invention relates to the field of the Internet, and more particularly to a method and apparatus for predicting the number of user behaviors of words in a website.

在網際網路領域中，對於網站或搜索引擎來說，網站流量和網站點擊量一般呈規律性變化，可以利用歷史資料進行有效的預測，但是對於單詞的流量和點擊量來說，變化一般不具備規律性。對幾個基本概念進行澄清：單詞的流量是指在網際網路領域中，某個網站或搜索引擎上一個單詞在設定的時間週期內被搜索的次數；單詞的點擊量是指在網際網路領域中，某個網站或搜索引擎上一個單詞在設定的時間週期內被點擊的次數；網站流量，是指在網際網路領域中某個網站或搜索引擎上在設定的時間週期內所有單詞的流量之和；網站點擊量，是指在網際網路領域中，某個網站或搜索引擎上在設定的時間週期內所有單詞的點擊量之和；其中，所述的時間週期可以根據實際需求靈活設定，一般情況下時間週期設定為一天。In the Internet domain, for websites or search engines, website traffic and website traffic generally change regularly, and historical data can be used for effective prediction. However, for word traffic and click volume, changes are generally not Have regularity. Clarify several basic concepts: word traffic refers to the number of times a word on a website or search engine is searched in a set time period in the Internet domain; the word traffic refers to the Internet. In the realm, the number of times a word on a website or search engine is clicked within a set time period; website traffic refers to all words in a set time period on a website or search engine in the Internet domain. The sum of traffic; the number of clicks on a website is the sum of the hits of all words in a given time period on a website or search engine in the Internet domain; wherein the time period can be flexible according to actual needs. Set, in general, the time period is set to one day.

本發明實施例中，將單詞的流量或者點擊量統稱為單詞的用戶行為數。現有技術中，針對用戶行為數隨著時間週期變化不大的部分單詞，可以採用單詞前一段時間週期內用戶行為數的均值作為單詞在當前時間週期內的用戶行為數的預測結果；針對用戶行為數隨著時間週期呈規律性變化的部分單詞，可以利用時間序列模型對變化規律進行建模或者利用現有的預測演算法(例如機器學習、資料包絡分析等)，從而得到單詞的用戶行為數的預測結果。In the embodiment of the present invention, the flow or the click amount of a word is collectively referred to as the number of user behaviors of the word. In the prior art, for a part of words whose user behavior changes little with time period, the average value of the number of user behaviors in a period of time before the word may be used as a prediction result of the number of user behaviors of the word in the current time period; Some words whose values change regularly with time period can be modeled by time series model or using existing prediction algorithms (such as machine learning, data envelopment analysis, etc.) to obtain the number of user behaviors of words. forecast result.

現有技術中提供的單詞的流量及點擊量的預測方法，存在如下問題：很難判斷單詞的用戶行為數隨時間週期的變化幅度大小、以及是否呈規律性變化，從而無法準確選擇有效的預測演算法，導致預測的可靠性低；只有滿足一定要求的序列才可以利用時間序列模型進行預測，而實際單詞的用戶行為數的序列一般無法滿足要求，而利用時間序列模型之外的預測演算法，導致設備的運算量較大、運算複雜度較高，對設備的性能消耗較大；網際網路領域中，面對大量的單詞，不可能針對每一個單詞建立不同的預測模型，而分類建立預測模型往往導致性能下降，預測的準確率降低。然而，對未來資料的準確預測可以使網站的運營者瞭解到網站伺服器將來會承受多大的網站流量及點擊量的衝擊，以便對網站伺服器的運行狀況作出調整。例如，若網站的流量及點擊量急劇增大，則需要對伺服器進行擴充，若網站流量及點擊量減小，則可以利用空閒的伺服器處理其他業務需求。綜上所述，現有單詞的流量及點擊量的預測方法，預測的準確率和可靠性低，設備的運算量較大、運算複雜度高，對設備的性能消耗較大。The method for predicting the flow rate and the amount of clicks provided by the prior art has the following problems: it is difficult to determine the magnitude of the change in the number of user behaviors of a word over time, and whether it changes regularly, thereby failing to accurately select an effective prediction algorithm. The method leads to low reliability of prediction; only the sequence that satisfies certain requirements can be predicted by the time series model, and the sequence of the number of user behaviors of the actual words generally cannot meet the requirements, and the prediction algorithm other than the time series model is used. As a result, the computing capacity of the device is large, the computational complexity is high, and the performance of the device is relatively large. In the field of the Internet, facing a large number of words, it is impossible to establish different prediction models for each word, and the classification establishes predictions. Models often result in performance degradation and reduced prediction accuracy. However, accurate predictions of future data can enable website operators to understand how much website traffic and click traffic will be impacted in the future in order to adjust the health of the web server. For example, if the traffic and clicks of a website increase sharply, the server needs to be expanded. If the website traffic and the amount of clicks are reduced, the idle server can be used to handle other business needs. In summary, the prediction method of the current word traffic and the amount of clicks has a low accuracy and reliability, and the computing capacity of the device is large, the operation complexity is high, and the performance of the device is relatively large.

本發明實施例提供一種單詞的用戶行為數的預測方法和裝置，用以解決現有單詞的用戶行為數的預測方法存在的預測的準確率和可靠性低，設備的運算量較大、運算複雜度較高，對設備的性能消耗較大的問題。The embodiment of the invention provides a method and a device for predicting the number of user behaviors of a word, which are used to solve the prediction method of the existing word user behavior, and the prediction accuracy and reliability are low, and the calculation amount of the device is large and the operation complexity is large. Higher, the problem of large performance consumption of the device.

本發明實施例提供一種單詞的用戶行為數的預測方法，包括：對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換；根據轉換得到的頻域序列確定所述歷史資料序列的每一個估計週期及其影響程度值；根據所述歷史資料序列的每一個估計週期及其影響程度值，判斷所述歷史資料序列是否滿足平穩序列標準；如果是，採用預測點之前若干歷史資料點的用戶行為數的均值作為預測點的用戶行為數；否則，根據每一個估計週期及其影響程度值選擇所述歷史資料序列的主週期和奇異點，並基於選定的主週期和奇異點得到預測點的用戶行為數。An embodiment of the present invention provides a method for predicting a number of user behaviors of a word, including: performing a time domain to a frequency domain conversion on a historical data sequence of a user behavior number of a word; and determining the historical data sequence according to the converted frequency domain sequence. Each estimation period and its influence degree value; determining, according to each estimation period of the historical data sequence and its influence degree value, whether the historical data sequence satisfies a stationary sequence criterion; if so, using some historical data points before the prediction point The mean value of the number of user behaviors is used as the number of user behaviors of the predicted points; otherwise, the main period and the singular point of the historical data sequence are selected according to each estimation period and its influence degree value, and prediction is obtained based on the selected main period and singular point The number of user actions for the point.

本發明實施例提供一種單詞的用戶行為數的預測裝置，包括：轉換單元，用於對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換；確定單元，用於根據轉換得到的頻域序列確定所述歷史資料序列的每一個估計週期及其影響程度值；判斷單元，用於根據所述歷史資料序列的每一個估計週期及其影響程度值，判斷所述歷史資料序列是否滿足平穩序列標準，如果是，則判定所述歷史資料序列為平穩序列，否則判定所述歷史資料序列為非平穩序列；第一預測單元，用於針對平穩序列，採用預測點之前若干歷史資料點的用戶行為數的均值作為預測點的用戶行為數；選擇單元，用於針對非平穩序列，根據每一個估計週期及其影響程度值選擇所述歷史資料序列的主週期和奇異點；第二預測單元，用於基於選定的主週期和奇異點得到預測點的用戶行為數。An embodiment of the present invention provides a device for predicting the number of user behaviors of a word, comprising: a converting unit, configured to perform time domain to frequency domain conversion on a historical data sequence of a user behavior number of a word; and a determining unit, configured to obtain according to the conversion Determining, by the frequency domain sequence, each estimation period of the historical data sequence and its influence degree value; the determining unit, configured to determine, according to each estimation period of the historical data sequence and the influence degree value thereof, whether the historical data sequence is satisfied a stationary sequence criterion, if yes, determining that the historical data sequence is a stationary sequence, otherwise determining that the historical data sequence is a non-stationary sequence; and the first prediction unit is configured to use, for the stationary sequence, a plurality of historical data points before the predicted point The mean value of the number of user behaviors is used as the number of user behaviors of the predicted points; the selecting unit is configured to select the main period and the singular point of the historical data sequence according to each estimation period and its influence degree value for the non-stationary sequence; the second prediction unit , the number of user behaviors used to get predicted points based on the selected main period and singular points.

本發明實施例提供的單詞的用戶行為數的預測方法和裝置，首先對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換，確定出該歷史資料序列的每一個估計週期及其影響程度值，從而可以準確判斷單詞的用戶行為數的變化是否大、以及是否呈規律性變化；針對平穩序列，利用均值演算法進行預測，針對非平穩序列，選定主週期和奇異點，基於主週期和奇異點得到預測點的用戶行為數，針對不同的序列採取不同的預測演算法，能夠減輕系統的工作壓力，對於平穩序列的歷史資料可以快速預測到將來資料，對於非平穩序列的歷史資料可以準確、可靠的預測到將來資料；本發明實施例提供的單詞的用戶行為數的預測方法和裝置，對網際網路領域中大量的單詞均可適用，並且時域到頻域的轉換、以及針對平穩序列和非平穩序列的預測演算法均易於實現，能夠有效降低設備的運算量和運算複雜度，降低對設備的性能消耗。The method and apparatus for predicting the number of user behaviors of a word provided by an embodiment of the present invention first convert a historical data sequence of a user behavior number of a word into a time domain to a frequency domain, and determine each estimation period of the historical data sequence and Influencing the degree value, so that it can accurately determine whether the change in the number of user behaviors of the word is large and whether it changes regularly; for the stationary sequence, the mean algorithm is used for prediction, and for the non-stationary sequence, the main period and the singular point are selected, based on the main The period and singular points get the number of user behaviors of the predicted points. Different prediction algorithms are used for different sequences, which can alleviate the working pressure of the system. For the historical data of the stationary sequence, the future data can be quickly predicted, and the historical data for the non-stationary sequence. The method and apparatus for predicting the number of user behaviors of words provided by the embodiments of the present invention can be applied to a large number of words in the Internet domain, and the time domain to frequency domain conversion, and Predictive algorithms for both stationary and non-stationary sequences are easy to implement Can effectively reduce the computational complexity and the computational device, to reduce consumption of the device performance.

本發明的其他特徵和優點將在隨後的說明書中闡述，並且，部分地從說明書中變得顯而易見，或者藉由實施本發明而瞭解。本發明的目的和其他優點可藉由在所寫的說明書、申請專利範圍、以及附圖中所特別指出的結構來實現和獲得。Other features and advantages of the invention will be set forth in the description which follows, The objectives and other advantages of the invention may be realized and obtained by the structure of

本發明實施例提供一種單詞的用戶行為數的預測方法和裝置，對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換得到頻域序列，根據頻域序列確定歷史資料序列的每一個估計週期及其影響程度值，從而判斷出該歷史資料序列是否呈明顯規律性變化即是否為平穩序列，針對不同的序列採取不同的預測演算法，從而降低設備的運算量和運算複雜度，降低對設備的性能消耗，提升預測的準確率和可靠性。Embodiments of the present invention provide a method and apparatus for predicting the number of user behaviors of a word, and performing a time domain to frequency domain conversion on a historical data sequence of a user behavior number of a word to obtain a frequency domain sequence, and determining each of the historical data sequences according to the frequency domain sequence. An estimation period and its influence degree value, thereby judging whether the historical data sequence has a regular regular change, that is, whether it is a stationary sequence, and adopting different prediction algorithms for different sequences, thereby reducing the computational complexity and computational complexity of the device. Reduce the performance consumption of the device and improve the accuracy and reliability of the prediction.

以下結合說明書附圖對本發明的較佳實施例進行說明，應當理解，此處所描述的較佳實施例僅用於說明和解釋本發明，並不用於限定本發明，並且在不衝突的情況下，本發明中的實施例及實施例中的特徵可以互相組合。The preferred embodiments of the present invention are described with reference to the accompanying drawings, and the preferred embodiments described herein are intended to illustrate and explain the invention, and not to limit the invention, and The embodiments of the present invention and the features of the embodiments may be combined with each other.

在介紹本發明實施例的具體實施方式之前，首先澄清幾個基本概念：單詞的用戶行為數的預測，是指基於單詞的用戶行為數(流量或者點擊量)的歷史資料預測未來資料，需要說明的是，歷史資料和未來資料對應的時間週期保持一致。所述的單詞一般為用戶的搜索詞、購買詞等。Before introducing the specific embodiments of the embodiments of the present invention, firstly clarify several basic concepts: the prediction of the number of user behaviors of words refers to predicting future data based on historical data of the number of user behaviors (flows or clicks) of words, and needs to be explained. The time period corresponding to historical data and future data is consistent. The words are generally search words, purchase words, and the like of the user.

例如，若時間週期為天，則可以基於某個單詞最近30天的流量預測第31天和第32天的流量；若時間週期為小時，則可以基於某個單詞最近20小時的點擊量預測第21小時、第22小時和第23小時的點擊量，等等。For example, if the time period is days, you can predict the traffic on the 31st and 32nd days based on the traffic of the last 30 days of a word; if the time period is hour, you can predict the traffic based on the last 20 hours of a word. 21 hours, 22 hours and 23 hours of traffic, and so on.

為了實現對單詞的用戶行為數的預測，需要給出單詞的用戶行為數的歷史資料序列、並指定預測點的數量。單詞的用戶行為數的歷史資料序列，是指由單詞的用戶行為數的歷史資料點組成的序列，歷史資料點表示時間點和歷史資料兩方面含義，預測點表示時間點和未來資料兩方面含義。例如基於某個單詞最近30天的流量預測第31天和第32天的流量，則歷史資料序列由30個歷史資料點組成，每一個歷史資料點表示特定日期(第1天至第30天中的一天)和當天的流量兩方面含義，預測點有兩個，每一個預測點表示特定日期(第31天和第32天中的一天)和當天的預測流量兩方面含義。In order to predict the number of user behaviors of a word, it is necessary to give a historical data sequence of the number of user behaviors of the word and specify the number of predicted points. The historical data sequence of the number of user behaviors of words refers to the sequence consisting of historical data points of the number of user behaviors of words. The historical data points represent the meanings of time points and historical data. The predicted points represent the meaning of time points and future data. . For example, based on the traffic on the 31st and 32nd days of the traffic forecast for the last 30 days of a word, the historical data sequence consists of 30 historical data points, each of which represents a specific date (from day 1 to day 30). The meaning of the day and the flow of the day, there are two prediction points, each of which indicates the meaning of the specific date (the day of the 31st and 32nd days) and the forecasted flow of the day.

奇異點，是指在網際網路領域中，一個單詞的用戶行為數發生明顯變化的時間點。例如，在該時間點前後某個單詞的用戶行為數分別屬於不同的數量級，或者該時刻點前後某個單詞的用戶行為數出現了明顯的上升或下降。Singularity refers to the point in time when the number of user behaviors of a word changes significantly in the Internet domain. For example, the number of user behaviors of a word before and after the time point belong to different orders of magnitude, or the number of user behaviors of a word before or after the point of time has obviously increased or decreased.

本發明實施例首先提供了一種單詞的用戶行為數的預測系統，該預測系統實際運行的網路架構如圖1所示，包括網站資料庫100、應用伺服器101、預測裝置102、解析伺服器103，其中：網站資料庫100，用於儲存網站日誌，網站日誌中記錄用戶對每一個單詞的搜索、點擊操作，以及操作時間等資訊；應用伺服器101，用於提供基於單詞的用戶行為數預測的各種應用服務，例如提供用戶介面，根據運維人員的實際需求發起單詞的用戶行為數的預測請求，並展現預測結果即預測點的用戶行為數；預測裝置102，用於根據應用伺服器101發起的該預測請求生成單詞的用戶行為數的解析請求並發送給解析伺服器103，根據解析伺服器103返回的單詞的用戶行為數的歷史資料序列，得到預測點的用戶行為數，並返回給應用伺服器101；解析伺服器103，用於根據預測裝置102發送的該解析請求解析網站資料庫100中的網站日誌，從解析結果中提取單詞的用戶行為數的歷史資料序列，並返回給預測裝置102。The embodiment of the present invention first provides a prediction system for the number of user behaviors of a word. The network architecture of the prediction system is as shown in FIG. 1 , and includes a website database 100 , an application server 101 , a prediction device 102 , and a parsing server . 103. The website database 100 is configured to store a website log, wherein the website log records information about a search, a click operation, and an operation time of each word, and the application server 101 is configured to provide a number of word-based user behaviors. The predicted various application services, for example, providing a user interface, initiating a prediction request for the number of user behaviors of the word according to the actual needs of the operation and maintenance personnel, and presenting the predicted result, that is, the number of user behaviors of the predicted points; and the predicting device 102 for using the application server The prediction request initiated by 101 generates a request for parsing the number of user behaviors of the word and sends it to the parsing server 103, and obtains the number of user behaviors of the predicted points based on the historical data sequence of the number of user behaviors of the words returned by the parsing server 103, and returns An application server 101; a parsing server 103 for transmitting according to the prediction device 102 The analysis request parses the website log in the website database 100, extracts a historical data sequence of the number of user behaviors of words from the analysis result, and returns it to the prediction device 102.

基於上述單詞的用戶行為數的預測系統，本發明實施例提供了一種單詞的用戶行為數的預測方法，如圖2所示，包括：S201、對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換；S202、根據轉換得到的頻域序列確定該歷史資料序列的每一個估計週期及其影響程度值；其中，估計週期是指頻域序列的可能的週期，就是根據頻率值換算得到的週期值，影響程度值是指估計週期在頻域序列中所占的比重；S203、根據該歷史資料序列的每一個估計週期及其影響程度值，判斷該歷史資料序列是否滿足平穩序列標準；S204、如果是，採用預測點之前若干歷史資料點的用戶行為數的均值作為預測點的用戶行為數；否則，根據每一個估計週期及其影響程度值選擇所述歷史資料序列的主週期和奇異點，並基於選定的主週期和奇異點得到預測點的用戶行為數。主週期是指從歷史資料序列的各估計週期中選擇的一個可能性最高的週期。Based on the prediction system of the number of user behaviors of the above words, the embodiment of the present invention provides a method for predicting the number of user behaviors of a word. As shown in FIG. 2, the method includes: S201: performing a time domain on a historical data sequence of a user behavior number of a word Conversion to the frequency domain; S202, determining each estimation period of the historical data sequence and its influence degree value according to the frequency domain sequence obtained by the conversion; wherein the estimation period refers to a possible period of the frequency domain sequence, that is, conversion according to the frequency value The obtained period value, the influence degree value refers to the proportion of the estimation period in the frequency domain sequence; S203, determining whether the historical data sequence satisfies the stationary sequence standard according to each estimation period of the historical data sequence and its influence degree value S204, if yes, using the mean value of the number of user behaviors of several historical data points before the predicted point as the number of user behaviors of the predicted point; otherwise, selecting the main period of the historical data sequence according to each estimation period and its influence degree value Singular points, and based on the selected main period and singular points, get the number of user behaviors of the predicted points. The main period refers to one of the most probable periods selected from each estimation period of the historical data sequence.

在S201的具體實施中，首先對單詞的用戶行為數的歷史資料序列的提取過程進行說明。應用伺服器根據運維人員的實際需求發起單詞的用戶行為數的預測請求，預測裝置根據應用伺服器發起的該預測請求生成單詞的用戶行為數的解析請求並發送給解析伺服器，解析伺服器根據預測裝置發送的該解析請求解析網站資料庫中的網站日誌，從解析結果中提取單詞的用戶行為數的歷史資料序列，並返回給預測裝置；從而預測裝置可以對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換。In the specific implementation of S201, the extraction process of the historical data sequence of the number of user behaviors of words is first described. The application server initiates a prediction request for the number of user behaviors of the word according to the actual needs of the operation and maintenance personnel, and the prediction device generates a resolution request for the number of user behaviors of the word according to the prediction request initiated by the application server, and sends the analysis request to the parsing server, and the parsing server Parsing the website log in the website database according to the parsing request sent by the predicting device, extracting a historical data sequence of the number of user behaviors of the word from the parsing result, and returning to the predicting device; thereby predicting the history of the number of user actions that the device can have on the word The data sequence is converted from time domain to frequency domain.

一般利用FFT(Fast Fourier Transformation，快速傅立葉轉換)、小波轉換等對歷史資料序列進行時域到頻域的轉換，其中FFT演算法是DFT(Discrete Fourier Transformation，離散傅立葉轉換)的一種快速演算法。針對單詞的用戶行為數的歷史資料序列來說，時域是描述其時間特性時使用的坐標系，時域波形表示單詞的用戶行為數的歷史資料隨時間的變化，其中橫坐標為時間，縱坐標為時間點對應的歷史資料；頻域是描述其頻率特性時使用的坐標系，頻域波形表示該歷史資料序列的每一個可能的週期(即估計週期)的影響程度值，其中橫坐標為與估計週期相對應的頻率，縱坐標為頻率點對應的估計週期的影響程度值。The time domain to frequency domain conversion of historical data sequences is generally performed by using FFT (Fast Fourier Transformation) and wavelet transform. The FFT algorithm is a fast algorithm of DFT (Discrete Fourier Transformation). For the historical data sequence of the number of user behaviors of words, the time domain is the coordinate system used to describe its time characteristics, and the time domain waveform represents the historical data of the number of user behaviors of words over time, where the abscissa is time, vertical The coordinate is the historical data corresponding to the time point; the frequency domain is the coordinate system used to describe the frequency characteristic, and the frequency domain waveform represents the influence degree value of each possible period (ie, the estimation period) of the historical data sequence, wherein the abscissa is The frequency corresponding to the estimation period, and the ordinate is the influence degree value of the estimation period corresponding to the frequency point.

在S202的具體實施中，以FFT為例介紹實現原理。In the specific implementation of S202, the implementation principle is introduced by taking FFT as an example.

離散傅立葉轉換公式如公式[1]所示：The discrete Fourier transform formula is shown in equation [1]:

其中，x(n)表示單詞的用戶行為數的歷史資料序列，X(k)表示轉換得到的頻域序列，頻域序列是以頻率為橫坐標的序列，每一個頻率值與歷史資料序列的每一個可能的週期(即估計週期)相對應；其中，頻率值k對應的縱坐標值X(k)為頻率值k對應的估計週期的影響程度值，進一步根據頻域序列中頻率值k換算得到對應的估計週期，根據頻率值k對應的縱坐標值X(k)得到估計週期的影響程度值。Wherex (n ) represents the historical data sequence of the number of user behaviors of the word,X (k ) represents the frequency domain sequence obtained by the conversion, and the frequency domain sequence is a sequence of the frequency as the abscissa, and each frequency value and the sequence of historical data every possible cycle (i.e. estimation period) corresponds; wherein the frequency value ofk corresponding to the ordinate value on the extent of the valueX(k) is the estimated cycle frequency valuek corresponding further converted according to the frequency-domain sequence, the frequency valuek A corresponding estimation period is obtained, and the influence degree value of the estimation period is obtained according to the ordinate valueX (k ) corresponding to the frequency valuek .

下面介紹頻率值k與估計週期的對應關係：The following describes the correspondence between the frequency valuek and the estimation period:

離散傅立葉逆轉換公式如公式[2]所示：The discrete Fourier inverse transformation formula is shown in formula [2]:

假設x(n)週期為T，則公式[3]成立：Assuming that thex (n ) period isT , then the formula [3] holds:

根據公式[3]可以得出公式[4]：According to the formula [3], the formula [4] can be obtained:

公式[4]即可表示頻率值k與估計週期的對應關係，其中，N表示單詞的用戶行為數的歷史資料序列的點數，所述點數是指該歷史資料序列中歷史資料點的數量；k表示第k個頻率的數值，取值範圍是[1，N-1]；T表示估計週期。Formula [4] can represent the correspondence between the frequency valuek and the estimation period, whereN represents the number of points of the historical data sequence of the number of user behaviors of the word, and the number of points refers to the number of historical data points in the historical data sequence. ;k represents the value of thekth frequency, the value range is [1, N-1];T represents the estimation period.

舉例進行說明，假設單詞的用戶行為數的歷史資料序列為某個單詞最近N天的流量，則時域波形如圖3a所示，利用FFT對該歷史資料序列進行時域到頻域的轉換得到頻域序列，則頻域波形如圖3b所示。根據頻域序列，頻率值k對應的縱坐標值X(k)作為頻率值k對應的估計週期的影響程度值，結合公式[4]即可得到根據頻率值k換算得到的估計週期及其影響程度值。假設單詞的流量的點數N等於40，頻域序列中頻率值k等於4時，對應的縱坐標值X(k)等於6，則根據公式[4]可以得到頻率值k等於4對應的估計週期為，估計週期10的影響程度值X(k)為6。For example, if the historical data sequence of the word user behavior is the traffic of the most recent N days of a word, the time domain waveform is as shown in FIG. 3a, and the time domain to frequency domain conversion of the historical data sequence is obtained by using FFT. In the frequency domain sequence, the frequency domain waveform is shown in Figure 3b. The frequency domain sequence, the frequency valuek corresponding to the ordinate value of the estimated value of the influence degree cycleX(k) corresponding to the frequency valuek, in conjunction with the formula [4] can be obtained from the frequency estimation cycle and its influence in terms of the value ofk obtained Degree value. Assuming that the number of pointsN of the word flow is equal to 40, and the frequency valuek in the frequency domain sequence is equal to 4, the corresponding ordinate valueX (k ) is equal to 6, and an estimate corresponding to the frequency valuek equal to 4 can be obtained according to the formula [4]. Cycle is It is estimated that the influence degree valueX (k ) of the period 10 is 6.

在S202的具體實施中，根據從網站日誌的解析結果中提取出的單詞的用戶行為數的歷史資料序列，將該歷史資料序列中的每一個歷史資料作為FFT的輸入資料即可得到輸出結果；根據輸出結果中每一個頻率值可以得到該歷史資料序列的每一個可能的週期，本發明實施例中稱為估計週期；每一個頻率值對應的影響程度值為該頻率值相應的估計週期的影響程度值。In a specific implementation of S202, according to a historical data sequence of the number of user behaviors of words extracted from the parsing result of the website log, each historical data in the historical data sequence is used as an input data of the FFT to obtain an output result; According to each frequency value in the output result, each possible period of the historical data sequence can be obtained, which is referred to as an estimation period in the embodiment of the present invention; the influence degree value corresponding to each frequency value is an influence of the estimation period corresponding to the frequency value. Degree value.

所謂平穩序列是指未呈現明顯規律性變化的序列，明顯規律性變化即週期變化，理論上平穩序列的定義如下：對於序列x(t)，如果滿足如下條件，則為平穩序列，否則為非平穩序列，所述條件包括：The so-called stationary sequence refers to a sequence that does not exhibit a significant regular change. The obvious regularity change is a periodic change. The theoretically stable sequence is defined as follows: For the sequencex (t ), if the following conditions are met, it is a stationary sequence, otherwise it is non- a stationary sequence, the conditions including:

(1)對於任意的tN，<+∞；(數學期望值的平方小於正無窮大)(1) For anytN , <+∞; (the square of the mathematical expectation is less than positive infinity)

(2)對於任意的tN，EX_t=μ；(數學期望值為常數)(2) For anytN ,EX_t =μ; (mathematical expectation is constant)

(3)對於任意的t,sN，E[(X_t-μ)(X_s-μ)]=γ_t_-_s。(自協方差函數為常數)(3) For anyt ,sN ,E [(X_t -μ)(X_s -μ)]=γ_t_-_s . (The autocovariance function is a constant)

在S203的具體實施中，預先設置的平穩序列標準包括：所有估計週期的影響程度值均不超過設定的影響程度臨限值。一般情況下，影響程度臨限值的取值為10。具體實施中，根據不同的應用可以靈活設置平穩序列標準，例如至少90%的估計週期的影響程度值不超過設定的影響程度臨限值。In the specific implementation of S203, the preset stationary sequence standard includes: the influence degree value of all the estimated periods does not exceed the set impact degree threshold. In general, the impact threshold is 10. In a specific implementation, the stationary sequence standard can be flexibly set according to different applications, for example, the influence degree value of at least 90% of the estimation period does not exceed the set impact degree threshold.

例如針對某個單詞的用戶行為數的歷史資料序列，如果確定出其每一個估計週期點及其影響程度值如表1所示，則可以判定該歷史資料序列為平穩序列，如果確定出其每一個估計週期點及其影響程度值如表2所示，則可以判定該歷史資料序列為非平穩序列。For example, for a historical data sequence of the number of user behaviors of a certain word, if it is determined that each of the estimated periodic points and its influence degree value is as shown in Table 1, it can be determined that the historical data sequence is a stationary sequence, if it is determined that each An estimated periodic point and its influence degree value are shown in Table 2, and the historical data sequence can be determined to be a non-stationary sequence.

在S204的具體實施中，針對平穩序列，所述的若干歷史資料點的具體數量可以根據實際的應用靈活設定。In the specific implementation of S204, for the stationary sequence, the specific number of the historical data points may be flexibly set according to actual applications.

在S204的具體實施中，針對非平穩序列，所述的根據每一個估計週期及其影響程度值選擇所述歷史資料序列的主週期和奇異點，具體包括：主週期是指歷史資料序列的各估計週期中可能性最高的一個估計週期，根據配置的主週期範圍，將滿足所述主週期範圍且影響程度值最大的估計週期作為主週期；並在主週期之外的各估計週期中，將影響程度值最大的估計週期作為奇異點。In the specific implementation of S204, for the non-stationary sequence, the main period and the singular point of the historical data sequence are selected according to each estimation period and its influence degree value, and specifically, the main period refers to each of the historical data sequences. An estimation period with the highest probability in the estimation period, according to the configured main period range, an estimation period that satisfies the main period range and has the largest influence degree value as a main period; and in each estimation period other than the main period, The estimation period with the largest influence degree value is used as the singular point.

舉例進行說明，例如針對某個單詞的用戶行為數的歷史資料序列，確定出其每一個估計週期點及其影響程度值如表2所示，針對時間週期為一天的具體應用場景，藉由分析大量的資料實驗和實際業務資料，確定主週期範圍為小於等於7，則選定主週期為7，奇異點為42。For example, the historical data sequence of the number of user behaviors of a certain word is determined, and each estimated period point and its influence degree value are determined as shown in Table 2, and the specific application scenario with a time period of one day is analyzed. A large number of data experiments and actual business data, to determine the main cycle range is less than or equal to 7, then the selected main cycle is 7, the singularity is 42.

基於選定的主週期和奇異點，得到預測點的用戶行為數的一種較佳實現方案，具體包括如下步驟：步驟1、選取該歷史資料序列中奇異點之後的各歷史資料點組成訓練資料序列；步驟2、利用時間序列模型對所述訓練資料序列進行建模求解，得到預測點的用戶行為數。A preferred implementation scheme for obtaining the number of user behaviors of the predicted points based on the selected primary period and the singular point includes the following steps: Step 1. Selecting historical data points after the singular points in the historical data sequence to form a training data sequence; Step 2: Modeling the training data sequence by using a time series model to obtain a user behavior number of the predicted point.

基於選定的主週期和奇異點，得到預測點的用戶行為數的另一種較佳實現方案，如圖4所示，具體包括如下步驟：S401、選取該歷史資料序列中奇異點之後的各歷史資料點組成訓練資料序列；S402、分別對同一主週期位置上的各訓練資料進行取均值運算，得到每一個主週期位置對應的週期均值；S403、將每一個訓練資料與其主週期位置對應的週期均值相減，得到去除週期的訓練資料序列；S404、利用時間序列模型對所述去除週期的訓練資料序列進行建模求解，得到預測點的去除週期的用戶行為數；所述的時間序列模型一般採用ARMA模型(Auto-Regressive and Moving Average Model，自回歸滑動平均模型)，ARMA模型由AR模型(自回歸模型)和MA模型(滑動平均模型)為基礎混合構成；ARMA模型的定義如公式[5]所示：Another preferred implementation scheme for obtaining the number of user behaviors of the predicted points based on the selected main period and the singular point, as shown in FIG. 4, specifically includes the following steps: S401: selecting historical data after the singular points in the historical data sequence Points constitute a training data sequence; S402, respectively, each training data in the same main cycle position is averaged to obtain a periodic mean corresponding to each main cycle position; S403, a periodic mean corresponding to each training data and its main cycle position Subtracting, obtaining a training data sequence with a removal period; S404, using a time series model to model the training data sequence of the removal period, and obtaining a user behavior number of the removal period of the prediction point; the time series model is generally adopted ARMA model (Auto-Regressive and Moving Average Model), ARMA model is composed of AR model (autoregressive model) and MA model (sliding average model); ARMA model is defined as formula [5] Shown as follows:

其中，ε_t表示白雜訊序列，φ、θ為參數；具體實施中，將去除週期的訓練資料序列中各訓練資料作為ARMA模型的輸入資料，再利用參數估計演算法(最小二乘演算法、最大似然估計演算法等)進行參數估計即可得到φ、θ的參數值；得到φ、θ的參數值之後代入ARMA模型，在將去除週期的訓練資料序列中各訓練資料作為ARMA模型(已帶入φ、θ的參數值)的輸入資料，根據輸出結果便可得到預測點的預測結果即去除週期的用戶行為數。Whereε_t denotes a white noise sequence, φ and θ are parameters; in the specific implementation, each training data in the training data sequence of the removal period is used as input data of the ARMA model, and then the parameter estimation algorithm (least squares algorithm) is used. The maximum likelihood estimation algorithm, etc.) can obtain the parameter values of φ and θ by parameter estimation; obtain the parameter values of φ and θ and substitute them into the ARMA model, and use the training data in the training data sequence of the removal period as the ARMA model ( The input data of the parameter values of φ and θ are taken, and the predicted result of the predicted point, that is, the number of user behaviors of the removal period can be obtained according to the output result.

S405、將預測點的去除週期的用戶行為數與其主週期位置對應的週期均值相加，得到預測點的用戶行為數。S405. Add the number of user behaviors of the removal period of the predicted point to the period average corresponding to the position of the main period, and obtain the number of user behaviors of the predicted point.

舉例說明具體實施過程，假設某個單詞的流量的訓練資料序列為：1.1、2.1、3.1、3.9、0.9、2.2、2.9、4.1，單位為百次；選定的主週期為4；需要說明的是，此處只是假設一個訓練資料序列說明具體實施過程；第一步、分別對每隔4個位置的訓練資料進行取均值運算，得到：For example, the specific implementation process assumes that the training data sequence of a word flow is: 1.1, 2.1, 3.1, 3.9, 0.9, 2.2, 2.9, 4.1, and the unit is one hundred; the selected main period is 4; Here, it is only assumed that a training data sequence illustrates the specific implementation process; the first step is to perform an average operation on the training data of every four positions to obtain:

第一個主週期位置對應的週期均值為：(1.1+0.9)/2=1The period average corresponding to the first main cycle position is: (1.1+0.9)/2=1

第二個主週期位置對應的週期均值為：(2.1+2.2)/2=2.15The period average corresponding to the second main cycle position is: (2.1+2.2)/2=2.15

第三個主週期位置對應的週期均值為：(3.1+2.9)/2=3The period average corresponding to the third main cycle position is: (3.1+2.9)/2=3

第四個主週期位置對應的週期均值為：(3.9+4.1)/2=4The period average corresponding to the fourth main cycle position is: (3.9+4.1)/2=4

第二步、將每一個訓練資料與其主週期位置對應的週期均值相減，得到：In the second step, each training data is subtracted from the period mean corresponding to the main cycle position, and the following is obtained:

1.1-1=0.11.1-1=0.1

2.1-2.15=-0.052.1-2.15=-0.05

3.1-3=0.13.1-3=0.1

3.9-4=-0.13.9-4=-0.1

0.9-1=-0.10.9-1=-0.1

2.2-2.15=0.052.2-2.15=0.05

2.9-3=-0.12.9-3=-0.1

4.1-4=0.14.1-4=0.1

去除週期的訓練資料序列為：0.1、-0.05、0.1、-0.1、-0.1、0.05、-0.1、0.1；The training data sequence of the removal cycle is: 0.1, -0.05, 0.1, -0.1, -0.1, 0.05, -0.1, 0.1;

第三步、將去除週期的訓練資料序列利用ARMA模型進行建模求解，得到預測點的去除週期的用戶行為數(即預測結果)，假設預測點的數量為3，則得到預測結果即每一個預測點的去除週期的用戶行為數為：-0.05、0.1、0.05；The third step is to use the ARMA model to solve the training data sequence of the removal period, and obtain the number of user behaviors (ie, prediction results) of the removal period of the prediction point. If the number of prediction points is 3, the prediction result is obtained. The number of user behaviors of the removal period of the predicted point is: -0.05, 0.1, 0.05;

第四步、將預測點的去除週期的用戶行為數與其主週期位置對應的週期均值相加，得到：In the fourth step, the user behavior number of the removal period of the prediction point is added to the period average corresponding to the position of the main period to obtain:

第一個預測點的用戶行為數為：-0.05+1=0.95The number of user behaviors for the first predicted point is: -0.05+1=0.95

第二個預測點的用戶行為數為：0.1+2.15=2.25The number of user behaviors for the second predicted point is: 0.1+2.15=2.25

第三個預測點的用戶行為數為：0.05+3=3.05The number of user behaviors for the third predicted point is: 0.05+3=3.05

由於時間序列模型對奇異點的敏感度很強，有時去除週期的訓練資料仍然會存在少數的奇異點，導致基於時間序列模型的預測結果存在較大偏差，基於此，本發明實施例進一步採用均值演算法與時間序列模型相結合的計算方法，對基於時間序列模型的預測結果進行判斷，如果預測結果明顯存在較大偏差，則利用基於主週期的均值演算法代替時間序列模型重新進行預測。例如：某個歷史資料序列(時間週期為一天)的主週期為7，如果藉由評估基於時間序列模型的預測結果發現預測結果明顯存在較大偏差，則採用當前預測點之前7天的歷史資料均值作為預測結果，即所述預測方法還包括如下步驟：確認當前得到的預測點的用戶行為數的偏差超出設定的偏差臨限值時，採用所述預測點之前一個主週期內去除週期的訓練資料的均值作為所述預測點的去除週期的用戶行為數；將預測點的去除週期的用戶行為數與其主週期位置對應的週期均值相加，得到預測點的用戶行為數。Because the time series model is very sensitive to singular points, sometimes there are still a few singular points in the training data of the removal period, which leads to a large deviation of the prediction results based on the time series model. Based on this, the embodiment of the present invention further adopts The calculation method combining the mean algorithm and the time series model is used to judge the prediction result based on the time series model. If the prediction result obviously has large deviation, the main period-based mean algorithm is used instead of the time series model to re-predict. For example, the main period of a historical data sequence (one time period is one day) is 7. If the prediction result based on the time series model is evaluated to find that there is a significant deviation in the prediction result, the historical data of the previous 7 days before the current prediction point is used. The average value is used as a prediction result, that is, the prediction method further includes the following steps: when it is confirmed that the deviation of the user behavior number of the currently obtained prediction point exceeds the set deviation threshold, the training of the removal period before the prediction point is adopted. The average value of the data is used as the number of user behaviors of the removal period of the prediction point; the number of user behaviors of the removal period of the prediction point is added to the period average corresponding to the position of the main period to obtain the number of user behaviors of the prediction point.

下面對單詞的用戶行為數的預測系統中，預測裝置的結構和功能進行詳細介紹，由於該預測裝置解決問題的原理與單詞的用戶行為數的預測方法相似，因此該預測裝置的實施可以參見方法的實施，重複之處不再贅述。該預測裝置的結構示意圖，如圖5所示，包括：轉換單元501，用於對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換；確定單元502，用於根據轉換得到的頻域序列確定所述歷史資料序列的每一個估計週期及其影響程度值；判斷單元503，用於根據所述歷史資料序列的每一個估計週期及其影響程度值，判斷所述歷史資料序列是否滿足平穩序列標準，如果是，則判定所述歷史資料序列為平穩序列，否則判定所述歷史資料序列為非平穩序列；第一預測單元504，用於針對平穩序列，採用預測點之前若干歷史資料點的用戶行為數的均值作為預測點的用戶行為數；選擇單元505，用於針對非平穩序列，根據每一個估計週期及其影響程度值確定所述歷史資料序列的主週期和奇異點；第二預測單元506，用於基於選定的主週期和奇異點得到預測點的用戶行為數。In the following, the structure and function of the prediction apparatus are described in detail in the prediction system of the number of user behaviors of words. Since the principle of solving the problem by the prediction apparatus is similar to the prediction method of the number of user behaviors of words, the implementation of the prediction apparatus can be referred to The implementation of the method, the repetition will not be repeated. The structure of the prediction apparatus is as shown in FIG. 5, and includes: a conversion unit 501, configured to perform time domain to frequency domain conversion on a historical data sequence of a user behavior number of words; and a determining unit 502, configured to perform conversion according to the conversion Determining, by the frequency domain sequence, each estimation period of the historical data sequence and its influence degree value; the determining unit 503, configured to determine, according to each estimation period of the historical data sequence and the influence degree value thereof, whether the historical data sequence is Satisfying the stationary sequence criterion, if yes, determining that the historical data sequence is a stationary sequence, otherwise determining that the historical data sequence is a non-stationary sequence; the first prediction unit 504 is configured to use the historical data before the prediction point for the stationary sequence The mean value of the number of user behaviors of the point is used as the number of user behaviors of the prediction point; the selecting unit 505 is configured to determine, according to each estimation period and its influence degree value, a main period and a singular point of the historical data sequence for the non-stationary sequence; The second prediction unit 506 is configured to obtain the number of user behaviors of the predicted points based on the selected primary period and the singular point.

具體實施中，選擇單元505的一種可能結構，具體包括：第一選擇子單元，用於根據配置的主週期範圍，將滿足所述主週期範圍且影響程度值最大的估計週期作為主週期；第二選擇子單元，用於在主週期之外的各估計週期中，將影響程度值最大的估計週期作為奇異點。In a specific implementation, a possible structure of the selecting unit 505, specifically includes: a first selecting subunit, configured to use, as the main period, an estimated period that satisfies the main period range and has the largest impact degree value according to the configured main period range; The second selection sub-unit is configured to use the estimation period with the largest influence degree value as the singular point in each estimation period except the main period.

具體實施中，第二預測單元506的一種可能結構，如圖6所示，具體包括：選取子單元601，用於選取所述歷史資料序列中奇異點之後的各歷史資料點組成訓練資料序列；預測子單元602，用於利用時間序列模型對所述訓練資料序列進行建模求解，得到預測點的用戶行為數。In a specific implementation, a possible structure of the second prediction unit 506, as shown in FIG. 6, specifically includes: a selection subunit 601, configured to select each historical data point after the singular point in the historical data sequence to form a training data sequence; The prediction sub-unit 602 is configured to perform modeling and solving the training data sequence by using a time series model to obtain a user behavior number of the predicted point.

具體實施中，第二預測單元506的另一種可能結構，如圖7所示，具體包括：選取子單元701，用於選取所述歷史資料序列中奇異點之後的各歷史資料點組成訓練資料序列；運算子單元702，用於分別對同一主週期位置上的各訓練資料進行取均值運算，得到每一個主週期位置對應的週期均值；去週期處理子單元703，用於將每一個訓練資料與其主週期位置對應的週期均值相減，得到去除週期的訓練資料序列；預測子單元704，用於利用時間序列模型對所述去除週期的訓練資料序列進行建模求解，得到預測點的去除週期的用戶行為數；週期恢復處理子單元705，用於用於將預測點的去除週期的用戶行為數與其主週期位置對應的週期均值相加，得到預測點的用戶行為數。In a specific implementation, another possible structure of the second prediction unit 506, as shown in FIG. 7, specifically includes: a selection subunit 701, configured to select each historical data point after the singular point in the historical data sequence to form a training data sequence. The operation sub-unit 702 is configured to perform an average operation on each training data in the same main cycle position to obtain a cycle average corresponding to each main cycle position; and a de-cycle processing sub-unit 703 for using each training data with The period average corresponding to the main period position is subtracted, and the training data sequence of the removal period is obtained; the prediction subunit 704 is configured to model the training data sequence of the removal period by using the time series model, and obtain the removal period of the prediction point. The number of user behaviors is used to add the number of user behaviors of the period of the prediction point to the period average corresponding to the period of the main period to obtain the number of user behaviors of the predicted point.

針對第二預測單元506的上述結構，還可以包括：重預測子單元706，用於確認週期恢復處理子單元705當前得到的預測點的用戶行為數的偏差超出設定的偏差臨限值時，採用所述預測點之前一個主週期內去除週期的訓練資料的均值作為所述預測點的去除週期的用戶行為數；將預測點的去除週期的用戶行為數與其主週期位置對應的週期均值相加，得到預測點的用戶行為數。The above-mentioned structure of the second prediction unit 506 may further include: a re-predicting sub-unit 706, configured to confirm that when the deviation of the number of user behaviors of the prediction points currently obtained by the period recovery processing sub-unit 705 exceeds the set deviation threshold, The mean value of the training data of the removal period in the previous one period of the prediction point is used as the user behavior number of the removal period of the prediction point; and the number of user behaviors of the removal period of the prediction point is added to the period average corresponding to the main period position, Get the number of user behaviors for the predicted points.

本發明實施例提供的單詞的用戶行為數的預測方法和裝置，首先對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換，確定出該歷史資料序列的每一個估計週期及其影響程度值，從而可以準確判斷單詞的用戶行為數的變化是否大、以及是否呈規律性變化；針對平穩序列，利用均值演算法進行預測，針對非平穩序列，選定主週期和奇異點，基於主週期和奇異點得到預測點的用戶行為數，針對不同的序列採取不同的預測演算法，能夠減輕系統的工作壓力，對於平穩序列的歷史資料可以快速預測到將來資料，對於非平穩序列的歷史資料可以準確、可靠的預測到將來資料。The method and apparatus for predicting the number of user behaviors of a word provided by an embodiment of the present invention first convert a historical data sequence of a user behavior number of a word into a time domain to a frequency domain, and determine each estimation period of the historical data sequence and Influencing the degree value, so that it can accurately determine whether the change in the number of user behaviors of the word is large and whether it changes regularly; for the stationary sequence, the mean algorithm is used for prediction, and for the non-stationary sequence, the main period and the singular point are selected, based on the main The period and singular points get the number of user behaviors of the predicted points. Different prediction algorithms are used for different sequences, which can alleviate the working pressure of the system. For the historical data of the stationary sequence, the future data can be quickly predicted, and the historical data for the non-stationary sequence. Accurate and reliable prediction of future data.

本發明實施例中，對網際網路領域中大量的單詞均可適用，並且時域到頻域的轉換、以及針對平穩序列和非平穩序列的預測演算法均易於實現，能夠有效降低設備的運算量和運算複雜度，降低對設備的性能消耗；本發明實施例中，針對非平穩序列，選定奇異點之後的各歷史資料點組成訓練資料序列，利用時間序列模型對訓練資料序列進行建模求解，藉由去週期處理和週期恢復處理降低頻域到時域的逆轉換造成的誤差，進一步降低設備的運算量和運算複雜度，降低對設備的性能消耗，提高預測的準確率。In the embodiment of the present invention, a large number of words in the Internet domain can be applied, and the conversion from the time domain to the frequency domain and the prediction algorithm for the stationary sequence and the non-stationary sequence are easy to implement, and the operation of the device can be effectively reduced. The quantity and operation complexity reduce the performance consumption of the device. In the embodiment of the present invention, for the non-stationary sequence, each historical data point after the singular point is selected to form a training data sequence, and the training data sequence is solved by using the time series model. By de-cycle processing and periodic recovery processing, the error caused by the inverse conversion from the frequency domain to the time domain is reduced, thereby further reducing the computational complexity and computational complexity of the device, reducing the performance consumption of the device, and improving the accuracy of the prediction.

本領域的技術人員應明白，本發明的實施例可提供為方法、裝置、或電腦程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, apparatus, or computer program product. Thus, the present invention can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining soft and hardware aspects. Moreover, the present invention may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable code. .

本發明是參照根據本發明實施例的方法、裝置和電腦程式產品的流程圖及/或方塊圖來描述的。應理解可由電腦程式指令實現流程圖及/或方塊圖中的每一流程及/或方塊、以及流程圖及/或方塊圖中的流程及/或方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器，使得藉由電腦或其他可編程資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能的裝置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowcharts and/or block diagrams, and combinations of flow and/or blocks in the flowcharts and/or <RTIgt; These computer program instructions can be provided to a processor of a general purpose computer, a special purpose computer, an embedded processor or other programmable data processing device to generate a machine for generating instructions by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.

這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造產品，該指令裝置實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能。The computer program instructions can also be stored in a computer readable memory that can boot a computer or other programmable data processing device to operate in a particular manner, such that instructions stored in the computer readable memory produce a manufactured product comprising the instruction device. The instruction device implements the functions specified in one or more flows of the flowchart or in a block or blocks of the flowchart.

這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上，使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程設備上執行的指令提供用於實現在流程圖一個流程或多個流程及/或方塊圖一個方塊或多個方塊中指定的功能的步驟。These computer program instructions can also be loaded onto a computer or other programmable data processing device to perform a series of operational steps on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the block diagram.

儘管已描述了本發明的較佳實施例，但本領域內的技術人員一旦得知了基本創造性概念，則可對這些實施例作出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本發明範圍的所有變更和修改。Although the preferred embodiment of the invention has been described, it will be apparent to those skilled in Therefore, the scope of the appended claims is intended to be construed as a

顯然，本領域的技術人員可以對本發明進行各種改動和變型而不脫離本發明的精神和範圍。這樣，倘若本發明的這些修改和變型屬於本發明申請專利範圍及其等效技術的範圍之內，則本發明也意圖包含這些改動和變型在內。It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of the inventions

100．．．網站資料庫100. . . Website database

101．．．應用伺服器101. . . Application server

102．．．預測裝置102. . . Prediction device

103．．．解析伺服器103. . . Parsing server

501．．．轉換單元501. . . Conversion unit

502．．．確定單元502. . . Determination unit

503．．．判斷單元503. . . Judging unit

504．．．第一預測單元504. . . First prediction unit

505．．．選擇單元505. . . Selection unit

506．．．第二預測單元506. . . Second prediction unit

601．．．選取子單元601. . . Select subunit

602．．．預測子單元602. . . Prediction subunit

701．．．選取子單元701. . . Select subunit

702．．．運算子單元702. . . Operational unit

703．．．去週期處理子單元703. . . De-cycle processing subunit

704．．．預測子單元704. . . Prediction subunit

705．．．週期恢復處理子單元705. . . Cycle recovery processing subunit

706．．．重預測子單元706. . . Re-predicting subunit

圖1為本發明實施例中單詞的用戶行為數的預測方法流程圖；1 is a flow chart of a method for predicting the number of user behaviors of words in an embodiment of the present invention;

圖2為本發明實施例中針對非平穩序列的一種較佳預測方法流程圖；2 is a flowchart of a preferred prediction method for a non-stationary sequence according to an embodiment of the present invention;

圖3a為本發明實施例中歷史資料序列的時域波形示意圖；3a is a schematic diagram of a time domain waveform of a historical data sequence in an embodiment of the present invention;

圖3b為本發明實施例中歷史資料序列的頻域波形示意圖；FIG. 3b is a schematic diagram of a frequency domain waveform of a historical data sequence according to an embodiment of the present invention; FIG.

圖4為本發明實施例中預測系統的網路架構示意圖；4 is a schematic diagram of a network architecture of a prediction system according to an embodiment of the present invention;

圖5為本發明實施例中單詞的用戶行為數的預測裝置框圖；5 is a block diagram of a device for predicting the number of user behaviors of words in an embodiment of the present invention;

圖6為本發明實施例中第二預測單元的一種可能結構框圖；6 is a block diagram of a possible structure of a second prediction unit according to an embodiment of the present invention;

圖7為本發明實施例中第二預測單元的另一種可能結構框圖。FIG. 7 is a block diagram of another possible structure of a second prediction unit according to an embodiment of the present invention.

Claims

Translated fromChinese

一種單詞的用戶行為數的預測方法，其特徵在於，包括：對單詞的用戶行為數的歷史資料序列進行時域到頻域的轉換；根據所述歷史資料序列轉換得到的頻域確定所述歷史資料序列的一或多個估計週期，及其每一個的影響程度值；根據所述歷史資料序列的該一或多個估計週期，及其每一個的影響程度值，判斷所述歷史資料序列是否平穩，所述判斷包含判斷所述一或多個估計週期的每一個的影響程度值是否超過配置的影響程度臨界值；如果是，基於預測點之前的所述歷史資料序列的用戶行為數的均值作為該預測點的用戶行為數；以及否則，根據所述一或多個估計週期及其每一個的影響程度值選擇所述歷史資料序列的主週期和奇異點，所述選擇包括：選擇所述一或多個估計週期的估計週期作為主週期，所述預估週期在配置的主週期範圍內且具有較大的影響程度值；以及選擇所述一或多個估計週期的另一預估週期作為奇異點，該另一預估週期的影響程度值大於所述一或多個預估週期的其他預估週期的影響程度值，該其他預估週期排除選擇作為主週期的所述預估週期，所述一或多個預估週期包含多個預估週期；以及基於所述選擇的主週期和所述選擇的奇異點計算預測點的用戶行為數。A method for predicting the number of user behaviors of a word, comprising: performing a time domain to a frequency domain conversion on a historical data sequence of a user behavior number of a word; determining the history according to a frequency domain obtained by converting the historical data sequence One or more estimation periods of the data sequence, and each of the influence degree values; determining whether the historical data sequence is based on the one or more estimation periods of the historical data sequence and the influence degree value of each of the sequences Smooth, the determining includes determining whether the influence degree value of each of the one or more estimation periods exceeds a configured influence degree threshold; if yes, based on the mean value of the user behavior number of the historical data sequence before the predicted point a number of user behaviors as the predicted point; and otherwise, selecting a primary period and a singular point of the historical data sequence based on the one or more estimated periods and each of the degree of influence values, the selecting comprising: selecting the An estimation period of one or more estimation periods is taken as a main period, and the estimation period is within a configured main period and has a larger a magnitude value; and selecting another estimated period of the one or more estimation periods as a singular point, the degree of influence of the other estimated period being greater than other estimated periods of the one or more estimated periods a degree of influence value that excludes the selection period as the main period,The one or more estimation periods include a plurality of estimation periods; and calculating a number of user behaviors of the predicted points based on the selected main period and the selected singular points.

如申請專利範圍第1項的方法，其中，基於所述選擇的主週期和所述選擇的奇異點計算預測點的用戶行為數包括：形成訓練資料序列，包括與所述歷史資料序列的個別歷史資料點有關的一組資料，所述個別的歷史資料點在對應至所述選擇的奇異點的點之後；以及基於所述訓練資料序列的模型，計算預測點的所述用戶行為數，所述模型藉由使用時間序列模型建立。The method of claim 1, wherein calculating the number of user behaviors of the predicted points based on the selected main period and the selected singular points comprises: forming a training data sequence, including an individual history with the historical data sequence a set of data related to the data point, the individual historical data points being after a point corresponding to the selected singular point; and calculating a number of the user behavior of the predicted point based on the model of the training data sequence, The model is built using a time series model.

如申請專利範圍第1項的方法，其中，基於所述選擇的主週期和所述選擇的奇異點計算預測點的用戶行為數包括：形成第一訓練資料序列，包括與所述歷史資料序列的個別歷史資料點有關的一組資料，所述個別的歷史資料點在對應至所述選擇的奇異點的點之後；藉由平均所述第一訓練資料序列的一組第一訓練資料而得到週期平均值，每一所述一組第一訓練資料對應至一或多個主週期位置；藉由將所述週期平均值從每一所述一組第一訓練資料減去而計算一組第二訓練資料；形成不包括包含所述一組第二訓練資料的週期之第二訓練序列；基於所述第二訓練資料序列的模型，計算不包括所述週期的一數預測點的用戶行為數，所述模型係藉由使用時間序列模型建立；以及藉由將所述一用戶行為數與對應至一或多個主週期位置中的一者的所述週期平均值相加，而計算所述預測點的用戶行為數。The method of claim 1, wherein calculating the number of user behaviors of the predicted points based on the selected main period and the selected singular points comprises: forming a first training data sequence, including the historical data sequence a set of data related to an individual historical data point, the individual historical data points being after a point corresponding to the selected singular point; obtaining a period by averaging a set of first training data of the first training data sequence An average value, each of the set of first training data corresponding to one or more main cycle positions; calculating a set of second by subtracting the cycle average from each of the set of first training data Training data; forming a cycle that does not include the second set of training materialsa second training sequence; calculating, based on the model of the second training data sequence, a number of user behaviors that do not include a number of predicted points of the period, the model being established by using a time series model; and by using the A number of user behaviors is added to the periodic average corresponding to one of the one or more primary periodic positions, and the number of user behaviors for the predicted points is calculated.

如申請專利範圍第3項的方法，更包括：在決定所述預測點的所述用戶行為數的偏差超過配置的偏差閾值後，選擇不包含所述週期的所述第二訓練序列之訓練資料平均值作為不包含所述週期的所述一預測點用戶行為數，所述訓練資料與所述預設點之前的一主週期相關聯。The method of claim 3, further comprising: after determining that the deviation of the number of user behaviors of the predicted point exceeds a configured deviation threshold, selecting training materials that do not include the second training sequence of the period The average value is the number of the predicted point user behaviors that do not include the period, and the training data is associated with a main period before the preset point.

如申請專利範圍第1項的方法，其中，所述對歷史資料序列進行時域到頻域的轉換包括藉由使用快速傅立葉轉換或是小波轉換。The method of claim 1, wherein the performing the time domain to the frequency domain conversion of the historical data sequence comprises using a fast Fourier transform or a wavelet transform.

如申請專利範圍第1項的方法，其中，所述單詞的所述用戶行為數包括所述單詞的流量或點擊量。The method of claim 1, wherein the number of user behaviors of the word comprises a flow or a click amount of the word.

如申請專利範圍第1項的方法，其中，所述一或多個估計週期具有兩個或多個不同的週期。The method of claim 1, wherein the one or more estimation periods have two or more different periods.

如申請專利範圍第8項的裝置，其中，所述第二預測包括：選取子單元，用於形成訓練資料序列，所述訓練資料序列包括與所述歷史資料序列的個別歷史資料點有關的一組資料，所述個別的歷史資料點在對應至所述選擇的奇異點的歷史資料點之後；以及預測子單元，用於基於所述訓練資料序列的模型，計所述預測點的所述用戶行為數，所述模型藉由使用時間序列模型建立。The apparatus of claim 8, wherein the second prediction comprises: selecting a subunit for forming a training data sequence, the training data sequence comprising an individual related to an individual historical data point of the historical data sequence Group data, the individual historical data points after a historical data point corresponding to the selected singular point; and a prediction subunit for calculating the user of the predicted point based on the model of the training data sequence The number of behaviors that were established by using a time series model.

如申請專利範圍第8項的裝置，其中，所述第二預測單元，包括：選取子單元，用於形成第一訓練資料序列，包括與所述歷史資料序列的個別歷史資料點有關的一組資料，所述個別的歷史資料點在對應至所述選擇的奇異點的點之後；運算子單元，用於藉由平均所述第一訓練資料序列的一組第一訓練資料而得到週期平均值，每一所述一組第一訓練資料對應至一或多個主週期位置；去週期處理子單元，用於：藉由將所述週期平均值從每一所述一組第一訓練資料減去而計算一組第二訓練資料，所述週期平均值對應至個別主週期位置；以及形成不包括包含所述一組第二訓練資料的週期之第二訓練序列；預測子單元，用於基於所述訓練資料序列的模型，計算不包括所述週期的一所述預測點的用戶行為數，所述模型係藉由使用時間序列模型建立；以及週期恢復處理子單元，用於藉由將所述一用戶行為數與對應至一或多個主週期位置中的一者的所述週期平均值相加，而計算所述預測點的用戶行為數。The device of claim 8, wherein the second prediction unit comprises: a selection subunit, configured to form a first training data sequence, including a group related to individual historical data points of the historical data sequence Data, the individual historical data points are after a point corresponding to the selected singular point; an operation subunit for obtaining a periodic average by averaging a set of first training data of the first training data sequence Each of the set of first training materials corresponds to one or more main cycle positions;De-cycle processing sub-units for: calculating a set of second training data by subtracting the periodic average from each of the set of first training data, the periodic average corresponding to an individual main cycle position And forming a second training sequence that does not include a period including the set of second training materials; a prediction subunit, configured to calculate a predicted point that does not include the period based on the model of the training data sequence a number of user behaviors, the model being established by using a time series model; and a period recovery processing sub-unit for using the number of the user behaviors to correspond to one of one or more main cycle positions The periodic averages are added, and the number of user behaviors of the predicted points is calculated.

如申請專利範圍第10項的裝置，其中，所述第二預測單元，包括：重預測子單元，用於：在決定所述預測點的所述用戶行為數的偏差超過配置的偏差閾值後，選擇不包含所述週期的所述第二訓練序列之訓練資料平均值作為不包含所述週期的所述一預測點用戶行為數，所述訓練資料與所述預設點之前的一主週期相關聯。The apparatus of claim 10, wherein the second prediction unit comprises: a re-predicting sub-unit, configured to: after determining that the deviation of the number of user behaviors of the predicted point exceeds a configured deviation threshold, Selecting an average value of the training data that does not include the second training sequence of the period as the number of the predicted point user behaviors that do not include the period, and the training data is related to a main period before the preset point Union.

如申請專利範圍第8項的裝置，其中，所述一或多個估計週期具有兩個或多個不同的週期。The apparatus of claim 8, wherein the one or more estimation periods have two or more different periods.

如申請專利範圍第8項的裝置，其中，所述單詞的所述用戶行為數包括所述單詞的流量或點擊量。The apparatus of claim 8, wherein the number of user behaviors of the word includes a flow or a click amount of the word.

如申請專利範圍第8項的裝置，其中，所述對歷史資料序列進行時域到頻域的轉換包括藉由使用快速傅立葉轉換或是小波轉換。The apparatus of claim 8, wherein the performing the time domain to the frequency domain conversion of the historical data sequence comprises using a fast Fourier transform or a wavelet transform.