Summary of the invention
The object of the invention is that a kind of communication network message classification system and method based on the mass user behavioral data is provided; This system and method can accurately be discerned all kinds of messages; Satisfy the fine granularity demand of data in the message analysis; Can be through message classification effectively to user behavior data, comprise that user's visit, search data carry out careful analysis.
Technical scheme of the present invention is following:
A kind of communication network message classification system based on the mass user behavioral data; Comprise the user data acquisition system; Said user data acquisition system is given the data cleansing module with the transfer of data of collecting; Message characteristic generating feature matrix after said data cleansing module will clean and extract is transferred to the sorting algorithm module, and said sorting algorithm module and the mutual swap data of disaggregated model, said disaggregated model are exported the model that finally is used for the message comparison through the model output module.
Said user data acquisition module enters the storage of subscriber data system with the storage that network is collected.
Said sorting algorithm module also receives the data of training dataset, and said disaggregated model also receives the verification msg of assessment data collection.
A kind of communication network packet classification method based on the mass user behavioral data, realize message classification through following steps:
(1) information in the user data acquisition module is imported the data cleansing module user data is cleaned, extract the characteristic of user communication network message, the generating feature matrix, and generate disaggregated model in the importing sorting algorithm module;
(2) use manual type that the classification of each communication network message is marked simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset is generated also is input to the sorting algorithm module simultaneously; The sorting algorithm module is learnt the disaggregated model about message to training dataset; The eigenmatrix of assessment data collection production is input in the disaggregated model intermediate object program; Verification model output result and artificial annotation results, the accuracy of coming judgment models according to the accuracy and the recall rate of gained;
(3) parameter after the disaggregated model checking is fed back to the sorting algorithm module, constantly the sorting algorithm module is optimized, with the robustness and the model accuracy of raising system under real complex situations;
(4) set up final mask and be used for being connected, predict the classification of communication network message with new message through model output module output.
The network message classification mark that said manual type is distinguished comprises the search engine message, web page browsing message, resource downloading page or leaf message, ad material message.
User behavior data is collected and with information stores access customer data-storage system through the user data acquisition module.
Technique effect of the present invention is:
In the communication network message, there is a large amount of type of messages miscellaneous, in order to carry out the analysis and the excavation of the degree of depth to these messages, must correct all kinds of messages of identification.Because data volume is huge, become very difficult so accomplish this task in the object time and in the target accuracy rate.The present invention is through careful analysis communication network message; Extracted the characteristic of message according to user behavior; Used from the technique construction of data mining and machine learning a whole set of accurately system of all kinds of messages of identification then; Comprise the entire flow of collecting final online use from original message, guaranteed the accurate identification of message in the object time.
Embodiment
Below in conjunction with accompanying drawing the present invention is further specified.
As shown in Figure 1; A kind of communication network message classification system based on the mass user behavioral data; Comprise the user data acquisition system, said user data acquisition system is given the data cleansing module with the transfer of data of collecting, and the message characteristic generating feature matrix after said data cleansing module will clean and extract is transferred to the sorting algorithm module; Said sorting algorithm module and the mutual swap data of disaggregated model, said disaggregated model are exported the model that finally is used for the message comparison through the model output module.
Said user data acquisition module enters the storage of subscriber data system with the storage that network is collected.
Said sorting algorithm module also receives the data of training dataset, and said disaggregated model also receives the verification msg of assessment data collection.
A kind of communication network packet classification method based on the mass user behavioral data, realize message classification through following steps:
(1) information in the user data acquisition module is imported the data cleansing module user data is cleaned, extract the characteristic of user communication network message, the generating feature matrix, and generate disaggregated model in the importing sorting algorithm module;
(2) use manual type that the classification of each communication network message is marked simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset is generated also is input to the sorting algorithm module simultaneously; The sorting algorithm module is learnt the disaggregated model about message to training dataset; The eigenmatrix of assessment data collection production is input in the disaggregated model intermediate object program; Verification model output result and artificial annotation results, the accuracy of coming judgment models according to the accuracy and the recall rate of gained;
(3) parameter after the disaggregated model checking is fed back to the sorting algorithm module, constantly the sorting algorithm module is optimized, with the robustness and the model accuracy of raising system under real complex situations;
(4) set up final mask and be used for being connected, predict the classification of communication network message with new message through model output module output.
The network message classification mark that said manual type is distinguished comprises the search engine message, web page browsing message, resource downloading page or leaf message, ad material message.
User behavior data is collected and with information stores access customer data-storage system through the user data acquisition module.
Sorting algorithm module optimizing process: said sorting algorithm module receives computer and the artificial message classification eigenmatrix that is generated; And generation disaggregated model; The assessment data collection generation of the artificial input of said disaggregated model reception is all verified and is used the message classification eigenmatrix; Data after disaggregated model will be verified again feed back to the sorting algorithm module, so that its sorting algorithm module is optimized, so that classification more accurately afterwards.
The effect of cleaning module is to remove some noises in the data, and comprise two parts: some unnecessary samples are removed in (1); (2) remove some noise information in some sample.
Said training dataset comprises two parts, and the one, the network message classification of artificial mark is represented the characteristic vector of network message besides, generally representes with sparse vector, in order to meet the requirement of concrete sorting algorithm, can carry out corresponding format conversion.
Characteristic mainly is some information that can differentiate all kinds of messages, draws through manual analysis and statistics, and can be made up of three parts such as advertisement url characteristic: (1) comprises particular keywords, alimama, doubleclick, ad etc.; (2) generally be in the leaf node that user capture is set; (3) user directly to import ratio generally smaller.
Eigenmatrix refers to the matrix of the characteristic value formation of each sample.
The performance of estimating categorizing system has two aspects, and one is model accuracy, and one is the efficient of algorithm.Wherein influence an adequacy that key factor is exactly a characteristic of model accuracy, comprise the power and the number of characteristic.The present invention carries out according to user behavior message having been carried out careful classification on the basis of depth analysis at the communication network message to magnanimity, has extracted the characteristic of all kinds of messages meticulously, thereby has guaranteed the precision and the prediction accuracy of model.On efficiency of algorithm, carry out a large amount of optimization in addition, thereby guaranteed the actual effect of mass data processing.
The above is merely preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.