A kind of article recommended method based on Chinese Similarity measuresTechnical field
The present invention relates to Internet technical field, more particularly to a kind of article based on Chinese Similarity measuresRecommended method.
Background technique
With the continuous development of internet, people's lives habit and life style are undergoing revolutionary variation, interconnectThe development of net not only facilitates people's lives, but also also greatly increases the channel that people obtain information.China Internet networkInformation centre (CNNIC) mentions at " the 36th China Internet network state of development statistical report ", by June, 2015, ChinaInternet news userbase is 5.55 hundred million, wherein cell phone network news userbase is 4.60 hundred million;Internet news is obtained as informationThe important application of class is taken, utilization rate is only second to instant messaging, comes second.
Under the social background of big data, that user can be allowed logical using Goog l e, Baidu as the search engine of representativeCross the relevant information that input keyword is exactly found oneself needs.But if user can not accurate description meet oneself demandKeyword, search engine can not just play a role.Unlike search engine, recommender system passes through analysis user'sThe feature of behavior or the contents of a project, thus to find the interested content of user.With major news article information publishing platformThe development and growth of (such as wechat public platform), the quantity of article quickly increase, and user is continuous in the difficulty for obtaining article interestedIt increases, magnanimity article is brought to user also brings very big selection to perplex while extensive information content to user, how to helpUser efficiently excavates interested article as an information publishing platform major issue urgently to be solved.
Due to lacking enough user interest relevant informations, and the challenge that processing article faces, lead to internetThe automatic recommendation effect of upper article is limited, and there are also very big rooms for promotion for similar article proposed algorithm.Article proposed algorithm needsThe semantic ambiguity of natural language is coped with using natural language processing technique, syntax obscures, grammer is lack of standardization and word disunityEtc. difficult points, natural language is also converted into the mathematic sign that machine can identify, passes through the means of machine learning and data miningTo model, verify.Currently, relevant researchs existing a large amount of for similar article proposed algorithm, such as based on cluster and classificationArticle recommendation, the article recommendation based on keyword, the recommendation based on specific area hot topic article etc..Although correlative study can beCertain effect is obtained under certain application scenarios, but wherein occur complexity height, smaller scope of application, handmarking's costThe problems such as height, poor recommendation diversity, has limited to the application of article proposed algorithm.
Therefore, how to provide that a kind of that Internet user can be helped efficiently to excavate article interested, the scope of application is larger, artificialIt is those skilled in the art that cost is relatively low for label, recommends the diversity preferably article recommended method based on Chinese Similarity measuresThe problem of urgent need to resolve.
Summary of the invention
In view of this, can help to interconnect the present invention provides a kind of article recommended method based on Chinese Similarity measuresNetwork users efficiently excavate that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
To achieve the goals above, the invention provides the following technical scheme:
A kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtainsBattle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
Through the above technical solutions, technical effect of the invention: according to the point of interest of user, recommending the highest text of the degree of correlationChapter, the algorithm of realization are mainly the calculating of Chinese similitude, and Internet user can be helped efficiently to excavate article interested, be applicable inRange is larger, cost is relatively low by handmarking, recommends diversity preferable.
Preferably, text is crawled in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the step 1The main contents of chapter specifically include: content of text, head figure and article abstract;The content of text, for generate the word of article toAmount indicates;The head figure, the article that user recommends to user show;The article abstract, is extracted using TextRank algorithmFormula obtains three words of original text, to summarize the main contents of article.
Further, the module for mainly having used Python has a requests, BeautifulSoup andTextRank4Sentence article crawls the content mainly obtained.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithmIt is based on PageRank, for being text generation keyword and abstract;Entire WWW is a digraph, and node is webpage;There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance (PR value);D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to refer toTo the collections of web pages of the link of webpage i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it is the number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
Through the above technical solutions, the solution have the advantages that: each it is linked to the importance S of the webpage of the page(Vj)S(Vj) page gone out to all pages scoring scoring is also needed, therefore divided by OUT (Vj);Meanwhile page S (Vj)S(Vj) importance cannot only by others link the pages determine, also comprising certain probability come determine to want receiving byOther pages determine, this namely effect of d.PageRank needs just be tied using above formula successive ignitionFruit, when initial, the importance that each webpage can be set is 1.Above formula left side of the equal sign calculate the result is that webpage i after iterationPR value, before the PR value that right side of the equal sign is used is full iteration.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm,Original text is split as sentence, stop words is filtered out in each sentence, and only retains the word of specified part of speech;Obtain sentenceThe set of set and word;Each word is as a node in PageRank;Window size is set as k, it is assumed that a sentenceSon is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window;There are a undirected sides had no right between the corresponding node of any two words in either window;Based on figure is constituted above, countCalculate the importance of each word node;Most important several words are as keyword;Key phrase is extracted using TextRankReference;If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase;For example, at oneIn the article for introducing " support vector machines ", three keyword supports, vectors, machine can be found, are extracted by key phrase, it can be withObtain support vector machines.Extract abstract using TextRank and regard each sentence as a node in figure, if two sentences itBetween have similitude, it is believed that have a undirected side of having the right between corresponding two nodes, weight is similarity;It is calculated by PageRankThe highest several sentences of the importance that method is calculated are as abstract;Two sentence S are calculated using following formulaiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence iNumber;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0,The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh moduleRemovable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
It is to be appreciated that: jieba word segmentation module is used in term vector training, and the Word2vec module of Google, oneself is completeThe function of neologisms is identified at one.Because jieba participle is bad to three or more Chinese character new word identification effects, such as small tripPlay, automatic Pilot, Internet of Things, block chain etc. extract such neologisms using function, and basic principle is after jieba is segmentedTwo words such as a and b, they are close to, if c=a+b, i.e. a " small " and b " game " are combined into c " trivial games ", if cThe number of appearance is more than certain threshold value, then using c as a neologisms, and jieba dictionary is added.
Preferably, specific in the step 2 in a kind of above-mentioned article recommended method based on Chinese Similarity measuresStep includes:
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led toThe computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one-The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension toAmount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;InstituteState the method based on singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrixObtain term vector;
The method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model
The probability distribution of given context-prediction target word, first objective function, it is excellent then by gradient descent methodChange this neural network;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
Skip-gram model is the probability value of given target word prediction context, sets an objective function, then adoptsOptimal parameter solution is found with optimization method, objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilizedHierarchical softmax and Negative sampling two methods are trained the model.
It preferably, will be in a kind of above-mentioned article recommended method based on Chinese Similarity measures, in the step 3The article of recommendation is converted into term vector matrix, and is standardized to matrix data, by one group of vector table of an articleShow, and generate intermediate file data, carry out keyword extraction based on TF-IDF algorithm: word frequency-inverse file frequency is examined for informationThe common weighting technique that rope and information are prospected;A words is assessed for a copy of it text in a file set or a corpusThe significance level of part;
Word frequency is the number that some given word occurs in this document: TFw=entry w occurs in certain one kindNumber/such in all entry number;
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word itThe number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these toThe vector that amount then obtains this article indicates.
Preferably, it in a kind of above-mentioned article recommended method based on Chinese Similarity measures, will be used in the step 4Family keyword phrases are converted to one group of matrix, then read the article matrix that the step 3 obtains, and carry out matrix calculating, thusTo a column data, it is ranked up according to similarity factor;The standardization formula is as follows:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) andb(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phaseLike property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is totalNumber, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword toThe size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind based on Chinese phaseLike the article recommended method that property calculates, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, artificial markCost is relatively low for note, recommends diversity preferable.The main contents of article are crawled first with Python crawlers;Further according to crawlingThe main contents of article obtain term vector, and are trained;Then term vector matrix is converted by article to be recommended;Finally willUser key words phrase is converted to one group of matrix, then reads the term vector matrix of article conversion, carries out matrix calculating, thus obtainsOne column data, is ranked up according to similarity factor, carries out article recommendation to user.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show belowThere is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only thisThe embodiment of invention for those of ordinary skill in the art without creative efforts, can also basisThe attached drawing of offer obtains other attached drawings.
Fig. 1 attached drawing is flow chart of the invention;
Fig. 2 attached drawing is CBOW model structure schematic diagram of the invention;
Fig. 3 attached drawing is skip-gram model structure schematic diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention 1-3, technical solution in the embodiment of the present invention carry out it is clear,It is fully described by, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.BaseEmbodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all itsHis embodiment, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of article recommended methods based on Chinese Similarity measures, internet can be helped to useFamily efficiently excavates that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
As shown in Figure 1, a kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtainsBattle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
In order to further optimize the above technical scheme, the main contents that article is crawled in step 1 specifically include: in textAppearance, head figure and article abstract;Content of text, the term vector for generating article indicate;Head figure, the article that user recommends to userShow;Article abstract, using removable three words for obtaining original text of TextRank algorithm, to summarize the main interior of articleHold.
In order to further optimize the above technical scheme, TextRank algorithm is based on PageRank, for being text generationKeyword and abstract;Entire WWW is a digraph, and node is webpage;There is the link to webpage B in webpage A, from webpage AIt is directed toward the directed edge of webpage B;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance;D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to be directed toward webpageThe collections of web pages of the link of i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it isThe number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
In order to further optimize the above technical scheme, original text is split as sentence by TextRank algorithm, in each sentenceIn filter out stop words, and only retain the word of specified part of speech;Obtain the set of sentence and the set of word;Each word conductA node in PageRank;Window size is set as k, it is assumed that a sentence is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window;Any two words pair in either windowThere are a undirected sides had no right between the node answered;Based on figure is constituted above, the importance of each word node is calculated;MostImportant several words are as keyword;Key phrase reference is extracted using TextRank;If there are several passes in original textThe adjacent situation of keyword, keyword constitute a key phrase;Abstract is extracted using TextRank to regard each sentence in figure asA node, if having similitude between two sentences, it is believed that have a undirected side of having the right, weight between corresponding two nodesIt is similarity;The highest several sentences of the importance being calculated by PageRank algorithm are as abstract;It is calculated using following formulaTwo sentence SiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence iNumber;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0,The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh moduleRemovable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
In order to further optimize the above technical scheme, specific steps include: in step 2
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led toThe computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one-The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension toAmount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;BaseIn the method for singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrixObtain term vector;
Method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model(CBOW)
As shown in Fig. 2, the probability distribution of given context-prediction target word, first objective function, then pass through ladderDescent method is spent, this neural network is optimized;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
As shown in figure 3, skip-gram model is the probability value of given target word prediction context, a target letter is setNumber then finds optimal parameter solution using optimization method, and objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, when reducing trainingBetween, the model is trained using Hierarchical softmax and Negative sampling two methods.
In order to further optimize the above technical scheme, term vector matrix is converted by article to be recommended in step 3, and rightMatrix data is standardized, and an article is indicated with one group of vector, and generate intermediate file data, and TF-IDF is based onAlgorithm carries out keyword extraction:
The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information;A words is assessed for oneThe significance level of a file set or a copy of it file in a corpus;
Word frequency is the number that some given word occurs in this document: TFw=entry w in certain one kind occursNumber/such in all entry number;Some general words for theme there is no too big effect, though it is some go outThe less word of existing frequency can express the theme of article, so simple use is TF inappropriate.The design of weight must expireFoot: the ability of a word prediction theme is stronger, and weight is bigger, conversely, weight is smaller.In the article of all statistics, some words are onlyIt is to occur in wherein seldom several articles, then such word is very big to the effect of the theme of article, the weight of these words is answeredThe design it is larger.
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word itThe number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);Why denominator will add 1, be in order toAvoiding denominator is the low document-frequency of high term frequencies and the word in entire file set in 0 a certain specific file,It can produce out the TF-IDF of high weight.
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these toThe vector that amount then obtains this article indicates.
In order to further optimize the above technical scheme, user key words phrase is converted into one group of matrix in step 4, then readThe article matrix for taking step 3 to obtain carries out matrix calculating, thus obtains a column data, be ranked up according to similarity factor;StandardIt is as follows to change processing formula:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) andb(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phaseLike property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is totalNumber, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword toThe size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with otherThe difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodimentFor, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method partIt is bright.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined hereinGeneral Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the inventionIt is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase oneThe widest scope of cause.