Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
The first embodiment of the invention relates to a video recommendation method, which is applied to a server, wherein the server can be a video recommendation platform for video recommendation. The following describes implementation details of the video recommendation method according to this embodiment in detail, and the following is only provided for easy understanding and is not necessary to implement the present invention.
A flowchart of the video recommendation method in this embodiment may be as shown in fig. 1, and specifically includes:
step 101: and carrying out homogenization classification on each video to be recommended.
The videos to be recommended may be all videos published on a video recommendation platform, such as all short videos on a short video recommendation platform. It can be understood that some relatively-hot short videos on the video recommendation platform are usually simulated by many people and then distributed to the video recommendation platform, so that many sets of homogeneous videos which are similar in route and plot but distributed by different people exist on the video recommendation platform. The homogenization classification in this step refers to classifying the homogenization videos with similar set of routes and similar plots of the video recommendation platform into one category, for example, videos belonging to set 1 are classified into one category, and videos belonging to set 2 are classified into one category.
Specifically, the homogenization classification may be performed by: and acquiring the similarity among the videos to be recommended, and performing homogenization classification on the videos to be recommended according to the similarity. For example, one video may be arbitrarily selected as a reference video, and the other videos may be selected as referenced videos, the similarities between the reference video and the other referenced videos are respectively obtained, a referenced video with the similarity greater than a preset similarity to the reference video is selected, and the selected referenced video and the reference video are regarded as homogeneous videos and classified into one category. For example, videos belonging to the same package or belonging to the same story are categorized into one category.
In one example, the similarity between videos includes a speech similarity. The obtaining mode of the similarity of the lines can be as follows: and performing voice analysis on each video to be recommended, extracting voice lines in each video, converting the voice lines into text lines, comparing the text lines in each video, and acquiring line similarity between the videos according to the comparison result. For example, voice analysis is performed on all videos, and voice lines in the videos are converted into text lines through a function of converting voice into text lines. Then, videos with similar lines can be classified into one type through a convolutional neural Network Deep Semantic similarity calculation (DSSM) (CNN-DSSM for short).
In one example, the similarity between videos includes a bridge segment similarity. The bridge segment similarity may be obtained as follows, with reference to fig. 2:
step 201: and acquiring the content label of each video to be recommended.
The content tag includes an object identified in each video and an action of the object, and the object can be understood as a person, an animal, a robot, and the like.
In one example, if the target is a person, the person in each video may be identified using face recognition techniques. For example, a face database may be preset, where the face database includes faces of different people, and different numbers may be set for different faces, so that different faces may be represented by different numbers. When the face is detected in the video, the face can be searched in a detected face database, and if the face in the video is searched, the number of the searched face can be added to the person in the video; if the face in the video is not retrieved, the face in the video can be added into the face database, and a new number is established for the face. For example, the identified persons may be represented by numbers such as A, B, C in the same video. It is to be understood that the identified persons may include: person name, age, occupation, expression, clothing, etc.
In one example, if the motion of the target is the motion of a human body, the motion of the human body can be recognized using a human body motion recognition technique in deep learning. For example, the motion of a person can be recognized through a 3D convolutional neural network, and the motion of the person can be captured through the structure that the 3D convolutional neural network takes continuous video frames as a box and performs convolution by using a three-dimensional convolution kernel. In a specific implementation, the motion of the person can be recognized through an RGB + optical flow algorithm and the like. However, in the present embodiment, only the above two examples are provided to recognize the motion of the person, and the manner of recognizing the motion of the person in the specific implementation is not limited to this.
Optionally, the content features may further include any one or a combination of the following: scenes, background music, lines, objects associated with the motion of the target are identified in each video. The identified scene may be a scene when the motion of the target occurs, and the object associated with the motion of the target is an object that changes with the change of the motion of the target in the video.
In one example, the scene in each video may be identified by: and extracting the video frame images in each video, and identifying the scenes when the characters act in the extracted video frame images according to the scene identification capability in the deep learning. For example, Gist information, that is, global feature information, in the video frame image is extracted, and the Gist information is a low-dimensional signature vector of a scene. The global feature information is adopted to identify and classify the scene, images do not need to be segmented and local feature extraction is not needed, and rapid scene identification and classification can be achieved.
In one example, the recognition method of the object associated with the motion of the target in each video may be: objects in the video are identified using a deep learning capability, such as a YOLO network. The YOLO network solves the object detection as a regression problem, and the positions and the categories of all objects in the video frame image can be obtained by inputting the video frame image into the YOLO network. Since all objects in the video frame image are not necessarily related to the motion of the target, all objects in the video frame image can be screened out to screen out the objects related to the motion of the target. For example, the object associated with the motion of the target may be screened out according to whether the position or size of the object in the video frame image changes with the change of the motion of the target. In addition, in a specific implementation, when the size of an object in the video is less than a certain ratio in the screening, the object can be ignored. Moreover, in order to ensure the effect, the granularity of object identification does not need to be fine, and the object identification can be classified. Such as: bmax 6, which is classified as: an automobile; three a playing cards, which are classified as: playing cards.
In one example, each video to be recommended may be split to obtain a plurality of video segments; each video clip corresponds to one bridge segment, and then content tags of a plurality of video clips are respectively obtained, that is, each video clip may have a corresponding content tag. The splitting mode of each video to be recommended can be as follows: the lines are analyzed by Natural Language Processing (NLP), and the semantic consistency of the context is analyzed. When the semantic consistency suddenly slips down or is interrupted greatly, the scenario is judged to be finished, and an independent scenario, namely a video clip, is formed between the two breakpoints. The splitting mode of each video to be recommended can also be as follows: judging through the background, the role and the role clothes; for example, when the character clothes feature suddenly changes greatly, the continuity of the scenario may be in problem, and the judgment can be made by combining other elements. Such as: when the chief angle falls into the water pit and the clothes are deformed to cause the failure of identification, whether the scenarios are connected or not can be judged by combining the background, for example, when the character background is greatly changed, the scenarios are not connected, the disconnected points are used as the dividing points, and the independent scenarios are arranged between the two dividing points, namely a video clip. If the change range of the character background is not large, such as in a narrow room, whether the scenario is coherent can be further judged through NLP.
Step 202: and reasoning to obtain a reasoning label in each video according to the content label and a pre-established knowledge graph.
The knowledge graph stores inference relations between content labels and inference labels, and the inference labels are contents inferred from videos. The knowledge graph is composed of nodes and edges, each node represents an entity, and each edge represents a relationship, as shown in fig. 3. The inference relation between the content label and the inference label can be predefined, or automatically crawl data from the network to establish the inference relation.
In one example, the content tags in the video are: a, if a male, age 35, and a doctor kill B, C, D using the same method, the content obtained by knowledge graph inference, i.e. inference labels, may be: killer, serial killer and camouflage. In another example, the content tags in the video are: a, chief deputy, wearing police uniform, pursuing, B, and front subject, the content obtained by knowledge map inference, i.e. inference label, can be: a is good. It will be appreciated that if a is the leading corner and the subject of the video is a front face, then it is highly likely that he is a good person. In brief, the content tag corresponding to the inference tag of good person may include: "B (person), help (action), passerby (person)"; "B (character), in park (scene), rescue (action), animal"; "B, contra bad"; "B, with neutral role, friendly", etc. The content label corresponding to the inference label of the bad person can include: "A, gun kill, police"; "A, injury, passerby"; "A, kidnapping, principal angle", etc. It is understood that inference tags are difficult to obtain through visual level recognition, but can be obtained through knowledge graph inference.
In one example, inference labels of the video segments can be inferred according to content labels of the video segments obtained by splitting the videos to be recommended and a pre-established knowledge graph. I.e. each video segment may have a corresponding inference tag.
Step 203: and inputting the content label and the inference label into a pre-trained model, and outputting the bridge section to which each video belongs.
In one example, the content label and the inference label of each video can be input into a pre-trained model, and the bridge segment to which each video belongs can be output. If a plurality of bridge segments exist in the video, the model can directly output the plurality of bridge segments to which the video belongs. In a specific implementation, the output may be information such as the name or number of the bridge segment.
In another example, each video may be segmented in advance to obtain a plurality of video segments, and the content tags and the inference tags corresponding to the plurality of video segments are obtained, so that the content tags and the inference tags corresponding to the plurality of video segments may be input into the pre-trained model in sequence, and the bridge segments to which the plurality of video segments belong may be output.
In a specific implementation, the content tag and the inference tag may be combined and converted into corresponding texts, that is, the tags may exist in the form of texts, and are referred to as text tags. The video time and the text label can be in one-to-one correspondence, such as: 00:00:12-00:00:18, corresponding text labels are: name a, male, age 35, doctor, killer, concatenated killer, camouflage, night, road, running, forest, emotional tension, atmosphere tricks. And finally, inputting the text labels into a pre-trained model and outputting the bridge sections to which the videos belong. The pre-trained model can be a word vector model, and a bridge segment corresponding to the text label is output according to the input text label.
The following briefly describes the training mode of the above model:
firstly, selecting a training sample; that is, a large number of videos are selected as training samples.
Secondly, selecting sample characteristics; sample features may include content tags, inference tags, labeled bridge segments. For example, the content tags are obtained by identifying people, objects, actions, scenes, character expressions, background music, lines, and the like in each video. And establishing an inference label based on the whole content of the video based on the knowledge graph and the content label. The content tags may also be combined with inference tags and converted to text form called text tags. The video is labeled by a bridge segment, for example, the video can be labeled manually, and the labeled bridge segment may be: scarp must not die, dress women for men, hero rescue, etc.
Finally, training a sample; namely, training samples based on training samples and sample features, for example, using machine learning, training with text labels of certain types of bridge segments as input, and obtaining a word vector model.
In one example, after training to obtain the word vector model, the word vector model may be updated at intervals. The bridge segment output by the word vector model can be compared with the actual bridge segment, so that the parameters of the word vector model can be adjusted, for example, the parameters of the word vector model can be adjusted by increasing the sample data volume or increasing the training times, and the bridge segment determined by the word vector model is more accurate.
In one example, the manner of determining the bridge segment in each video may also be: and matching the content tag with each bridge segment in a preset bridge segment library to determine the bridge segment in each video. The preset bridge section library can be established in advance and comprises bridge sections with various types. For example, "hero rescues", its bridge section is: b is deceived by C, A knocks down C and saves B, and A and B are opposite in nature. In specific implementation, the relationship among the identified target, object, action of the target, and scene may be matched with each bridge segment in the preset bridge segment library to obtain the bridge segment in the current video. It will be appreciated that the action of dividing the target and the target is a necessary content feature to make the match. Scenes, background music, lines, objects associated with the targets, etc. are optional content features, but these two optional content features may facilitate more accurate matching.
Step 204: and acquiring the bridge segment similarity among the videos according to the bridge segments in the videos.
In one example, the bridge segment similarity between the videos may be obtained according to whether the same bridge segment exists in each video. For example, the similarity of a bridge segment between videos where the same bridge segment exists is greater than the similarity between videos where the same bridge segment does not exist. Assuming that the same bridge segment exists between video 1 and video 2, and the same bridge segment does not exist between video 1 and video 3, the bridge segment similarity between video 1 and video 2 is greater than the bridge segment similarity between video 1 and video 3.
In another example, when there are multiple bridge segments in the video, the similarity of the bridge segments between the videos can also be obtained by combining the number of identical bridge segments in each video. For example, the greater the number of identical bridge segments present between videos, the greater the similarity of the bridge segments. Assuming that there are 2 identical bridge segments in video 1 and video 2 and 3 identical bridge segments in video 1 and video 3, the bridge segment similarity between video 1 and video 3 is greater than the bridge segment similarity between video 1 and video 2.
Step 102: and according to the result of the homogeneous classification, screening in a plurality of videos belonging to the same classification result.
The number of the screened videos can be smaller than a preset threshold, the preset threshold can be set according to actual needs, and the number of the screened videos is not controlled to be large. In a specific implementation, 1 video may also be screened from a plurality of videos belonging to the same classification result, for example, 1 video is selected from a plurality of videos belonging to the set 1, and 1 video is selected from a plurality of videos belonging to the set 2. In one example, the screening may be by:
firstly, the sorting weight characteristics of a plurality of videos belonging to the same classification result are obtained. Wherein the ranking weight features include: and issuing account information and/or browsing times. The browsing times can be obtained by the statistics of the video service platform. In a specific implementation, the browsing times corresponding to a plurality of videos in a preset time period may be counted, for example, the browsing times of the plurality of videos in 3 days may be counted. The publishing account information can be acquired from the video service platform, and the publishing account information may include any one of or a combination of the following: the level of the release account, the number of people who pay attention to the release account, and whether the release account is paid attention by the user who requests to recommend the video. It will be appreciated that different publishing accounts may correspond to different levels, with higher levels indicating that videos published through the publishing account are generally of higher quality. In addition, different publishing accounts may correspond to different numbers of people who pay attention, and the larger the number of people who pay attention indicates that the quality of videos published through the publishing account is generally higher. In addition, it can be understood that the video recommendation platform generally pushes a video to a client after receiving a recommendation request from the client, that is, recommends the video to a user corresponding to the client. Considering that different users have different interests, the concerned distribution accounts have differences, and therefore, the distribution account information of multiple videos belonging to the same classification result may further include whether the user who requests to recommend the videos pays attention to the distribution accounts.
And then, determining the sorting weight corresponding to each of the plurality of videos according to the sorting weight characteristics.
In one example, the sorting weights corresponding to the plurality of videos may be determined according to the information of the publishing account. For example, the higher the level of the distribution account, the greater the ranking weight corresponding to the video distributed by the distribution account, the greater the number of people interested in the distribution account, and the greater the ranking weight corresponding to the video distributed by the distribution account. The publishing account is concerned by the user who requests to recommend the video, and the sequencing weight corresponding to the video published through the publishing account is larger.
In another example, the respective ranking weights of the plurality of videos may be determined according to the browsing times. It can be understood that the more browsing times of a video, the greater the ranking weight corresponding to the video.
Optionally, the sorting weights corresponding to the multiple videos may be determined according to the release account information and the browsing times. For example, in a plurality of videos, whether a user pays attention to the release account thereof as F is determined, where F is 1.2 for the release account concerned, F is 1 for the release account not concerned, and F can be adjusted according to actual conditions. The browsing times of the video are recorded as V (V takes a value within 3 days), then the ranking weight can be recorded as: f is multiplied by V. That is, if a video is recommended to the user 1, whether the user 1 pays attention to the distribution account is considered, and if a video is recommended to the user 2, whether the user 2 pays attention to the distribution account is considered. When the ranking weights respectively corresponding to the videos are determined, the determined ranking weights respectively corresponding to the videos are more targeted to the user requesting to recommend the videos by combining whether the user requesting to recommend the videos pays attention to the publishing accounts of the videos, and personalized recommendation to different users is facilitated.
And finally, screening a plurality of videos belonging to the same classification result according to the sorting weight. For example, a video with the largest ranking weight or a video with a ranking weight greater than a preset weight may be screened out; the preset weight may be set according to actual needs, and this embodiment is not particularly limited thereto.
In a specific implementation, a random filtering manner may be adopted for filtering among a plurality of videos belonging to the same classification result, however, the filtering manner is not specifically limited in this embodiment.
Step 103: and pushing the screened videos to the client.
The number of the screened videos is smaller than a preset threshold, the preset threshold can be set according to actual needs, and the number of the videos of the same set of paths pushed to the client is controlled to be small. For example, 1 push is selected from a plurality of videos belonging to the same package 1 and sent to the client, and 1 push is selected from a plurality of videos belonging to the same package 2 and sent to the client.
Compared with the prior art, the method and the device have the advantages that each video to be recommended is subjected to homogenization classification; according to the result of the homogenization classification, screening a plurality of videos belonging to the same classification result; pushing the screened videos to a client; and the number of the screened videos is smaller than a preset threshold value. The videos are subjected to homogenization classification, namely, the homogenization videos are used as a single class, the videos belonging to the same classification result are screened, and the screened videos are pushed to a client. That is to say, the homogeneous videos are screened, and the number of the screened videos is smaller than a preset threshold value, so that too many homogeneous videos, namely similar videos, can be prevented from being recommended to a user, and the freshness of the user in the process of browsing the videos is enhanced.
A second embodiment of the present invention relates to a video recommendation method. The following describes implementation details of the video recommendation method according to this embodiment in detail, and the following is only provided for easy understanding and is not necessary to implement the present invention. The flowchart of the video recommendation method in this embodiment may be as shown in fig. 4, whereinsteps 301 to 303 are substantially the same assteps 101 to 103 in the first embodiment, and steps 301 to 303 are not described again to avoid duplicating the embodiment.
Step 301: and carrying out homogenization classification on each video to be recommended.
Step 302: and according to the result of the homogeneous classification, screening in a plurality of videos belonging to the same classification result.
Step 303: and pushing the screened videos to the client.
Step 304: and if the shielding operation of the target video from the client is detected, shielding the videos which belong to the same classification result as the target video at the same time.
The target video is the video shielded by the user in the browsing process. Specifically, a display interface of the client may be provided with a trigger key with a shielding function, and the trigger key may be a virtual key. During the process of browsing videos, if a user does not like a certain video, the user can click the virtual key to shield the video. And after detecting that the video, namely the target video, is shielded, the video recommendation platform determines other videos which belong to the same classification result as the target video, and simultaneously shields the other videos. For example, the user masks the video 1, and the video recommendation platform determines that the classification result of the video 1 is the set road 1, and then the video recommendation platform simultaneously masks other videos belonging to the set road 1.
Compared with the prior art, in the embodiment, when a user feels that a certain video is not good and has poor quality when watching the video, all videos which belong to the same classification result with the video are shielded after selecting to shield the video, that is, other videos which have the same set path and the same story with the video are shielded, so that the watching experience of the user is improved.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the invention relates to a server, as shown in fig. 5, comprising at least oneprocessor 401; and amemory 402 communicatively coupled to the at least oneprocessor 401; thememory 402 stores instructions executable by the at least oneprocessor 401, and the instructions are executed by the at least oneprocessor 401, so that the at least oneprocessor 401 can execute the video recommendation method according to the first or second embodiment.
Where thememory 402 and theprocessor 401 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of theprocessor 401 and thememory 402 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by theprocessor 401 may be transmitted over a wireless medium via an antenna, which may receive the data and transmit the data to theprocessor 401.
Theprocessor 401 is responsible for managing the bus and general processing and may provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Andmemory 402 may be used to store data used byprocessor 401 in performing operations.
A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.