Disclosure of Invention
The present invention is directed to solving the above-mentioned drawbacks of the prior art, and provides a method for determining whether a video pushes relevant legal knowledge.
The purpose of the invention can be achieved by adopting the following technical scheme:
a method for determining whether a video pushes legal related knowledge, comprising:
constructing a legal provision library and a legal case library;
when the push waiting time reaches a certain threshold value, acquiring at least one subtitle and one bullet screen information corresponding to the playing content in a preset time period of the video;
extracting words meeting a first preset condition from the at least one subtitle to form a subtitle matching legal feature word set;
extracting words meeting a first preset condition from the at least one bullet screen to form a bullet screen matching legal feature word set;
extracting words meeting a second preset condition from the at least one bullet screen to obtain the number of the bullet screen matched with the questioning words;
extracting words meeting a third preset condition from the at least one barrage to obtain the number of the barrage matched with the negative emotion words;
and (4) carrying out similarity calculation on each text in the legal item library and the legal case library by using the subtitle matching legal feature word set to obtain the legal name and the similarity value of the text with the highest similarity value.
Carrying out heat analysis based on a search engine on the legal name to obtain a heat analysis value;
and acquiring the video playing progress percentage.
Taking a subtitle matching legal feature word set, a bullet screen matching legal feature word set, the number of bullet screen matching confusion words, the number of bullet screen matching negative emotion words, a similarity value, a heat analysis value and a video playing progress percentage as feature items, and performing binary classification by adopting a machine learning model to obtain a classification result;
the classification result is used for determining whether related legal information is pushed to the user at the current moment;
recalculating the word quantity and the pushing waiting time of the next period of time, and recommending again when the pushing waiting time reaches a certain threshold value again;
judging whether the classification result is correct or not according to the condition that a user clicks or browses the pushed clauses, and taking the judgment result and the characteristic items thereof as machine learning training data;
and retraining the machine learning model under a preset condition.
Preferably, the building of the legal case base includes: and automatically abstracting the text of the mined legal case text by using a text automatic abstraction technology, and respectively storing the obtained single legal case abstract text and the corresponding single original legal case text into the same document but different fields of a legal case library.
Preferably, the acquiring at least one subtitle and one bullet screen information corresponding to the playing content in the preset time period of the video includes: and if the video has the subtitle file, extracting the text in the subtitle file as a subtitle set to be processed. And if the subtitle file does not exist in the video, converting the audio played in the audio and video file into a text as a subtitle set to be processed according to a voice recognition technology.
Preferably, the words of the first preset condition include: and acquiring a preset legal knowledge word bank from the legal item library and the legal case library by combining a keyword extraction technology.
The obtaining of the special words in the legal field further comprises:
aiming at the existing preset words, a deep learning word vector tool is adopted for word expansion;
preferably, the words of the second preset condition include: a predetermined set of interrogatories, wherein the set of interrogatories has a degree of interrogatories.
Preferably, the words of the third preset condition include: and acquiring the emotional words with negative properties according to the public emotional dictionary, and constructing a negative emotional word library.
Preferably, the inputting into the machine learning model performs binary classification, including: the training method of the machine learning model adopts a Support Vector Machine (SVM) algorithm.
Preferably, the recalculating the word number and the push waiting time of the next period of time, and performing recommendation again when the push waiting time reaches a certain threshold value again includes: when the machine learning classification result is not recommended, resetting the pushing waiting time to be null, resetting the subtitle matching legal feature word set to be null, resetting the bullet screen matching legal feature word set to be null, resetting the number of bullet screen matching confusion words to be null, resetting the number of bullet screen matching negative emotion words to be null, resetting the similarity value to be null and resetting the heat analysis value to be null; and restarting the calculation of the pushing waiting time, and recommending again when the pushing waiting time reaches a certain threshold value again.
Preferably, the determining whether the classification result is correct according to the condition that the user clicks or browses the pushed essay includes: and acquiring a click browsing log of the user client to the recommendation, and judging that the pushing result is correct when the recommendation result is pushing and the user click browsing recommendation result exceeding a certain threshold percentage exceeds a certain time threshold, otherwise, judging that the pushing result is wrong.
The technical scheme provided by the invention has the following beneficial effects:
according to the method, a legal item library, a legal case library and a word library special for the legal field are constructed, subtitle and bullet screen information corresponding to video playing contents are captured, a subtitle matching legal feature word set, a bullet screen matching legal feature word set, the number of bullet screen matching question words and the number of bullet screen matching negative emotion words are obtained according to word library matching; and calculating the legal text with the highest similarity with the subtitle matching legal feature word set in the legal entry library and the legal case library and the search heat of a search engine thereof, acquiring the video playing progress percentage, inputting the features into a machine learning classifier for classification, and determining whether the current node pushes the corresponding legal entry or case, so that the pushing is more targeted, the technical problem of excessive knowledge pushing is solved, and the user experience is improved.
Detailed Description
The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, so that how to implement the present invention by applying technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Herein as "exemplary"
Any embodiment described is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure.
It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
As shown in fig. 1, the method for determining whether a video pushes related legal knowledge according to this embodiment includes the following steps:
1) constructing a legal bar library and a legal case library, which specifically comprises the following steps:
1.1) mining a large number of legal provisions from the Internet or legal documents by using the technologies such as a crawler or a regular expression, and storing the detailed legal provisions into an elastic search server by taking a certain provision as a document unit to form a legal provision library.
The ElasticSearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on RESTful web interface. The Elasticsearch was developed in Java and published as open source under the Apache licensing terms, and is currently a popular enterprise-level search engine. The system can achieve real-time search, and is stable, reliable, quick and convenient to install and use.
The format of the single legal provision data is as follows:
sexual intercourse
The second hundred and ninety three
One of the following aggressive sexual act, which disrupts social order, is apprehended, restrained or regulated in the following five years:
(I) beating others at will and having bad mood;
(II) chasing, intercepting, abusing, threatening others, and having bad plot;
(III) the strong holding is hard or is damaged at will, occupies public and private properties, and has serious plot;
and (IV) giving up a jeopardy in a public place. Causing serious disorder of the order of public places.
The method can collect behavior of other people for applying the antecedent money for many times, seriously destroy social order, and can punish money after five years or more and ten years or less.
1.2) mining a large number of legal cases from the Internet or legal documents by using technologies such as a crawler or a regular expression, performing text summarization on each legal case by using a text automatic summarization technology of a HanLP natural language processing package, and respectively storing the obtained single legal case summary and a corresponding single original legal case text into the same document but different fields of an elastic search server, namely an original legal case field and a legal case summary field, so as to form a legal case library.
The automatic abstract is a simple coherent short text which can automatically and accurately reflect the content of a certain document center by using a computer to automatically extract the abstract from an original document. The HanLP natural language processing bag has the characteristics of complete functions, high performance, clear architecture, modern linguistic data and self-definition. When rich functions are provided, the HanLP internal module insists on low coupling, the model insists on inertia loading, the service insists on static providing and the dictionary insists on plaintext publishing, so that the use is very convenient, and meanwhile, the HanLP internal module is provided with some corpus processing tools to help a user to train own corpus.
The format of the single legal case data is as follows:
basic case information
In 2014 3, 11, 23 nights, the victims are allowed to stand on ktv by 4 men and 2 women in one row of the grandson, so that the victims are injured when the victims hold one, grandson, don and Ma, and the facilities in the ktv are broken down. Criminal suspects are that someone in sun and someone in korea are judged to be a slight injury by judicial judgment, criminal suspects form criminal offences by the law of criminal law on one sun, criminal arrests are executed by central office of public bureau of metropolitan area and central office of metropolitan area on day 10 and 23 in 2014, criminal arrest is approved by central office of metropolitan area and central office of metropolitan area on day 11 and 6 in 2014, and fries are detained at guard houses of metropolitan area and central office of metropolitan area. If someone in the house, someone in the sun, someone after the house and someone in the horse are slightly injured, the criminal law is violated by the second hundred and thirty-four rules, so that the criminal law is suspected to constitute an intentional injury crime, and the criminal law is currently in the investigation stage.
Case results
In this case, a person who is one of grandchildren has a first story and is voluntarily convicted, the subjective malignancy is small, the social hazard is light, and the person who is one of grandchildren, another person who is one of graduates and another person who is one of grandchildren is intentionally harmed, and through mediation by a lawyer, three persons voluntarily compensate 8 m yuan for the person who is one of grandchildren and the person who is one of Korean, and meanwhile, the person who is one of grandchildren obtains an understanding of the other person. In 2015, 4, 23, the first criminal judgment book of the first criminal character in 2015, the first criminal judgment book of the first criminal character is used for judging whether a certain person seeks an aggressive criminal and the judgment is performed for 6 months.
The format of the single legal case summary data is as follows:
basic case information
According to the rules of criminal law, a certain tentacle on the sun sets the two hundred and thirty four rules, a certain 4 male and a certain 2 female on the sun fight the person at ktv, a certain one on the sun, a certain one after the horse and a certain one on the horse are slightly injured, so that the suspicion forms an intentional injury crime and the suspicion forms an aggressive sexual guilt.
Case results
Three people voluntarily claim for 8 ten thousand yuan for each of someone else, in the scheme, someone else has a first plot, decides to opportunistically commit crime, voluntarily commit, and judge to be futuristic for 6 months.
2) The method comprises the following steps of constructing a legal knowledge word bank, and specifically comprising:
2.1) extracting a large number of keywords contained in texts in the legal entry library and the legal case library by using a tfidf technology of a machine learning open source software package scimit-lean according to the legal entry library and the legal case library obtained in the step 1), and selecting words with high legal relevance such as 'turning', 'obsessive-compulsive', 'fighting', and the like through manual screening to construct an initial seed word library. Putting the established seed words into a synonym word forest for retrieval (or putting the seed words into a knowledge network body for retrieval) to obtain other similar words of the category, wherein the selection of the similar words is mainly selected semantically, the words not only consider the literal representation, but also record the corresponding words with similar semantics and concepts, and ensure the sufficient entry of the semantic words; and after enough similar words exist, performing second word diffusion collection.
According to the technology, a google open source tool word2vec based on Deep learning technology is adopted for carrying out second commodity lexicon divergence, the linguistic data of the word2vec is trained by the latest linguistic data, so that the problem that the synonym word forest or the known net cannot keep the latest words and expressions to be updated all the time is solved, and the lexicon is continuously maintained and optimized through multiple iterations of the method.
3) Capturing subtitle information corresponding to the playing content in the preset time period of the video, performing keyword matching on the subtitle information according to the legal knowledge word library obtained in the step 2), and adding the matched keyword set into the subtitle legal word set to form a subtitle matching legal feature word set.
And 3.1) acquiring a caption text file corresponding to the playing content in the preset time period of the video, and converting the audio played in the currently played content into a text as caption information if the caption text file cannot be extracted.
3.2) carrying out keyword matching on the subtitle text obtained in the step 3.1) according to the legal knowledge word bank obtained in the step 2), such as: the method comprises the steps that a caption text corresponding to playing content in a preset time period of a video is that when a user plays a certain word and goes to school at noon, a crowd gets a king word to cause serious injury to the brain of the king word, the caption text is matched with a legal knowledge word bank to obtain a keyword set of the crowd, the blow and the serious injury, and the keyword set is added into a caption legal word set to obtain a caption matching legal feature word set.
4) By manually referring to sentences which are used daily and represent questions, extracting main word and symbol characteristics of the questions, and extracting the main word and symbol characteristics including but not limited to ' why, what and so, and so ' to construct a suspicion word library '.
5) According to the emotion dictionary of the Hopki, the NTU emotion dictionary and the like, the emotion dictionary of negative characters is obtained, and a negative emotion word bank is constructed, wherein the emotion words related to the negative emotion are mainly extracted by a linguistic analysis method, and words related to the negative emotion, such as ' anger, death, anger, sadness, nausea, sadness, joy and the like, but not limited to ' anger, sadness and joy ' are extracted.
6) Capturing bullet screen information corresponding to playing contents in the preset time period of the video, performing keyword matching on the bullet screen information according to the legal knowledge word bank obtained in the step 2), the confusion word bank obtained in the step 4) and the negative emotion word set obtained in the step 5), and adding keywords matched from the legal knowledge word bank into the bullet screen legal word set to obtain a bullet screen matching legal feature word set; adding keywords matched from the confusion expression word library into a confusion trigger word set to obtain the number of bullet screen matching confusion words; adding the keywords matched from the negative emotion word set into the bullet screen matching negative emotion word set to obtain the number of the bullet screen matching negative emotion words;
7) the subtitle matching legal feature word set obtained in the step 3) is used as a retrieval key word set, such as: "intercept, abuse, disorder, order", regard Okapi BM25 algorithm as the calculation method of similarity, use search keyword set to go on the similarity search to each text of legal case summary field in legal entry library and legal case library separately, get and match legal characteristic word set highest single legal entry or legal case summary of similarity with subtitle, if it is legal case summary that is got, take brother field legal case text of its identical document as the content of returning. The single search returns the following results:
"the second hundred and ninety-three oppositional defiant acts have one of the following oppositional defiant acts, which destroys social order, with an evading, obligation or restraint … … for five years or less" and a retrieved french similarity score, such as "0.6325".
8) According to the legal provision or the legal case abstract with the highest content similarity with the subtitle matching legal feature word set obtained in the step 7), sending the represented legal title, such as 'oppositional defiant crime', into a Baidu search engine by adopting a crawler technology to obtain the Baidu index of the legal provision in the last month: 1578, it represents the search popularity of the article, that is, the degree of the article required by the ordinary user, so as to obtain the search popularity of the article to be pushed.
9) And acquiring the video playing progress percentage.
10) Matching the subtitle with the legal feature word set obtained in the step 3), the bullet screen with the legal feature word set obtained in the step 6), the number of the bullet screen with the confusion words and the number of the bullet screen with the negative emotion words, scoring the similarity of the French obtained in the step 7), searching the heat value of the to-be-pushed clauses obtained in the step 8), taking the percentage of the video playing progress obtained in the step 9) as the feature items required by machine learning, artificially labeling 10000 pieces of data as a training set, and training a 'push' binary classifier and a 'no-push' binary classifier by adopting a Support Vector Machine (SVM) algorithm according to the training set.
11) In the video playing process, every 10 seconds, the subtitle matching legal feature word set obtained in the step 3), the bullet screen matching legal feature word set obtained in the step 6), the number of the bullet screen matching confusion words and the number of the bullet screen matching negative emotion words, the legal similarity score obtained in the step 7), and the pushed clause retrieval heat value obtained in the step 8) are used as data required by machine learning prediction and are sent into the classifier obtained in the step 9) to obtain the predicted value of the classifier, namely 'pushing' or 'not pushing', so that whether the corresponding legal clause or case is pushed or not at the current video node is determined.
12) According to the condition that a user clicks and reads the pushed legal provisions or legal cases, the legal provisions or legal cases are used as new machine learning samples, the new machine learning samples are added into the training set required by the svm classifier obtained in the step 10), and the classifier is re-optimized and machine learning trained every time 1000 samples are added.
When a client browsed by a user clicks the pushed legal provision or legal case, a client program starts to calculate the time length of the pushed legal provision or legal case browsed by the user in a timing mode, when the time length exceeds 15 seconds, the pushing can be judged to be correct, the subtitle legal word set obtained in the step 3), the bullet screen obtained in the step 6) is matched with a legal feature word set, the number of puzzled words matched with the bullet screen and the number of negative emotional words matched with the bullet screen are counted, the similarity of the French obtained in the step 7) is scored, the pushed provision obtained in the step 8) is searched for the heat value, the video playing progress percentage in the step 9) is used as a feature item, the new sample is added into the svm classifier training set obtained in the step 10), and the svm classifier obtained in the step 10) is re-optimized and machine learning training is carried out every 1000 samples are added.