CN114189708A

Movatterモバイル変換

Info

Publication number: CN114189708A
Application number: CN202111485807.5A
Authority: CN
Inventors: 张鑫
Original assignee: State Grid E Commerce Technology Co Ltd
Current assignee: State Grid E Commerce Technology Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-15

Abstract

The embodiment of the application discloses a video content identification method and a related device, which can respectively determine a first to-be-determined video content label and a second to-be-determined video content label corresponding to a to-be-identified video according to image information and audio information of the to-be-identified video, wherein the first to-be-determined video content label can embody video content of the to-be-identified video from a video information dimension, and the second to-be-determined video content label can embody video content of the to-be-identified video from an audio information dimension. Therefore, the processing equipment can determine the video content label corresponding to the video to be identified according to the first video content label to be identified and the second video content label to be identified, the video content label is used for identifying the video content corresponding to the video to be identified, and then the video content of the video to be identified can be comprehensively identified by combining the information of the two dimensions of the video dimension and the audio dimension, so that the identification efficiency of the video content is improved on the premise of ensuring the identification accuracy of the video content.

Description

Video content identification method and related device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video content identification method and a related apparatus.

Background

Nowadays, with the increase of hardware and networks, more and more video platforms and apps appear. Under the circumstance of increasing information of video information, video becomes an indispensable part of internet information, and the higher the occupation ratio of the received information, the heavier the work of auditing and classifying the information is.

The important link of the auditing is to identify the content of the video. The current video system mainly relies on a manual mode to identify video contents, the auditing of video data is often divided into a plurality of layers, namely primary auditing, secondary auditing, expert review and the like, and the compliance and the attribute of the video contents are judged through the plurality of layers. The identification mode has various auditing levels and high labor cost, and is difficult to meet the current various video auditing requirements.

Disclosure of Invention

In order to solve the technical problem, the application provides a video content identification method, and a processing device can comprehensively identify image content of a video to be identified by combining two dimensions of video images and audio information, so that the identification efficiency of the video content is improved on the premise of ensuring the identification accuracy.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application discloses a video content identification method, where the method includes:

acquiring a video to be identified;

determining a first to-be-determined video content label corresponding to the to-be-identified video according to the image information of the to-be-identified video;

determining a second to-be-determined video content label corresponding to the to-be-identified video according to the audio information of the to-be-identified video;

and determining a video content label corresponding to the video to be identified according to the first video content label to be identified and the second video content label to be identified, wherein the video content label is used for identifying the video content corresponding to the video to be identified.

In one possible implementation, the method further includes:

determining first to-be-determined attribute information corresponding to the to-be-identified video according to the image information, wherein the first to-be-determined attribute information is used for identifying the matching degree between the video content of the to-be-identified video and a plurality of attributes under the label of the first to-be-determined video content;

determining second undetermined attribute information corresponding to the video to be identified according to the audio information, wherein the second undetermined attribute information is used for identifying the matching degree between the video content of the video to be identified and a plurality of attributes under the label of the second undetermined video content;

and determining attribute information corresponding to the video to be identified according to the first attribute information to be identified and the second attribute information to be identified, wherein the attribute information is used for identifying the matching degree between the video content of the video to be identified and a plurality of attributes under the video content label.

In a possible implementation manner, the determining, according to the image information of the video to be identified, a first to-be-determined video content tag corresponding to the video to be identified includes:

determining pixel mean difference values between adjacent ones of the plurality of successive video frame images;

determining a video frame image of a next frame in the target adjacent video frame images as a key frame image in response to the pixel average difference value between the target adjacent video frame images being greater than a preset threshold value;

determining a video frame image positioned at an intermediate frame among the plurality of consecutive video frame images as a key frame image in response to the pixel average difference value between no adjacent video frame images among the plurality of consecutive video frame images being greater than the preset threshold value;

and determining a first to-be-determined video content label corresponding to the to-be-identified video according to the key frame image.

In one possible implementation, the method further includes:

denoising the key frame image;

determining a first to-be-determined video content tag corresponding to the to-be-identified video according to the key frame image, wherein the determining comprises the following steps:

and determining a first to-be-determined video content label corresponding to the to-be-identified video according to the denoised key frame image.

In a possible implementation manner, determining a second to-be-determined video content tag corresponding to the to-be-identified video according to the audio information of the to-be-identified video includes:

determining text information corresponding to the audio information;

sentence dividing processing is carried out on the text information, and the sentence dividing processing is used for converting the text information into text information taking sentences as units;

recognizing word information in the processed text information according to a preset dictionary;

and determining a second to-be-determined video content label corresponding to the to-be-identified video according to the word information.

In a possible implementation manner, the recognizing, according to a preset dictionary, word information in the processed text information includes:

browsing the processed text information according to a first sequence by taking sentences as units;

determining words matched with the preset dictionary in the processed text information as first to-be-determined word information;

browsing the processed text information according to a second sequence by taking sentences as units;

determining words matched with the preset dictionary in the processed text information as second undetermined word information;

determining word information corresponding to the video to be identified according to the first to-be-determined word information and the second to-be-determined word information.

In a second aspect, an embodiment of the present application discloses a video content identification apparatus, where the apparatus includes an obtaining unit, a first determining unit, a second determining unit, and a third determining unit:

the acquisition unit is used for acquiring a video to be identified;

the first determining unit is used for determining a first to-be-determined video content label corresponding to the to-be-identified video according to the image information of the to-be-identified video;

the second determining unit is used for determining a second to-be-determined video content tag corresponding to the to-be-identified video according to the audio information of the to-be-identified video;

the third determining unit is configured to determine, according to the first to-be-determined video content tag and the second to-be-determined video content tag, a video content tag corresponding to the to-be-identified video, where the video content tag is used to identify video content corresponding to the to-be-identified video.

In one possible implementation manner, the apparatus further includes a fourth determining unit, a fifth determining unit, and a sixth determining unit:

the fourth determining unit is configured to determine, according to the image information, first to-be-determined attribute information corresponding to the to-be-identified video, where the first to-be-determined attribute information is used to identify a matching degree between video content of the to-be-identified video and a plurality of attributes under a first to-be-determined video content tag;

the fifth determining unit is configured to determine, according to the audio information, second undetermined attribute information corresponding to the video to be recognized, where the second undetermined attribute information is used to identify a matching degree between video content of the video to be recognized and a plurality of attributes under a label of the second undetermined video content;

the sixth determining unit is configured to determine, according to the first to-be-determined attribute information and the second to-be-determined attribute information, attribute information corresponding to the to-be-identified video, where the attribute information is used to identify matching degrees between video content of the to-be-identified video and multiple attributes under the video content tag.

In a possible implementation manner, the image information includes a plurality of consecutive video frame images corresponding to the video to be identified, and the first determining unit is specifically configured to:

In one possible implementation, the apparatus further includes a denoising unit:

the denoising unit is used for denoising the key frame image;

the first determining unit is specifically configured to:

In a possible implementation manner, the second determining unit is specifically configured to:

determining text information corresponding to the audio information;

browsing the processed text information in a second order, which is opposite to the first order, in sentence units;

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video content identification method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a video content identification method in an actual application scenario according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a system architecture according to an embodiment of the present application;

fig. 4 is a block diagram illustrating a structure of a video content recognition apparatus according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In the related art, the auditing of the video content is mainly performed manually, for example, by manually browsing the video to identify the video content therein. However, due to the continuous development of information technology, the amount of videos to be identified is also increasing, and it is difficult to satisfy the current video content identification requirement by simply performing identification manually.

It is understood that the method may be applied to a processing device, which is a processing device capable of performing video content identification, for example, a terminal device or a server having a video content identification function. The method can be independently executed through the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed through the cooperation of the terminal equipment and the server. The terminal device may be a computer, a mobile phone, or the like. The server may be understood as an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.

Next, a video content identification method provided by an embodiment of the present application will be described with reference to the drawings.

Referring to fig. 1, fig. 1 is a flowchart of a video content identification method provided in an embodiment of the present application, where the method includes:

s101: and acquiring a video to be identified.

The video to be identified can be any video needing video content identification, and the video content refers to content displayed in the video.

S102: and determining a first to-be-determined video content label corresponding to the to-be-identified video according to the image information of the to-be-identified video.

In order to guarantee the accuracy of content identification when performing automatic identification of video content, in the embodiment of the present application, a processing device may perform comprehensive identification of video content by combining two information dimensions, namely video and audio. Since the image information and the audio information of the video can embody the specific content in the video to a certain extent, for example, when the video is an animal video, the animal in the animal video can be identified through the animal image in the video and the animal cry in the audio, and the like, so that the identification of the video content by combining the two dimensions of information is helpful for the accurate identification of the video content.

First, the processing device may determine, according to image information of the video to be recognized, a first to-be-determined video content tag corresponding to the video to be recognized, where the first to-be-determined video content tag is capable of representing video content of the video to be recognized in a dimension of video information.

S103: and determining a second to-be-determined video content label corresponding to the to-be-identified video according to the audio information of the to-be-identified video.

And the second pending video content tag can represent the video content of the video to be identified under the dimensionality of the audio information.

S104: and determining a video content label corresponding to the video to be identified according to the first video content label to be determined and the second video content label to be determined.

The video content label is used for identifying the video content corresponding to the video to be identified. The processing device can integrate the first to-be-determined video content tag and the second to-be-determined video content tag, determine the information content respectively identified under two information dimensions, and further obtain a relatively accurate and comprehensive video content identification result.

In order to further improve the accuracy of video content identification, in one possible implementation, the processing device may perform more detailed identification on the video content based on the video content tag. The processing device may determine, according to the image information, first to-be-determined attribute information corresponding to the video to be recognized, where the first to-be-determined attribute information is used to identify a matching degree between video content of the video to be recognized and a plurality of attributes under a label of the first to-be-determined video content, that is, the attribute is a content identifier finer than a label classification. For example, "car category content" may belong to a content tag, and the car categories such as "sports car", "truck" and the like may be various attributes corresponding to the tag.

Similarly, the processing device may determine, according to the audio information, second undetermined attribute information corresponding to the video to be identified, where the second undetermined attribute information is used to identify a matching degree between the video content of the video to be identified and a plurality of attributes under a second video content label to be determined. The processing device can determine attribute information corresponding to the video to be identified according to the first attribute information to be identified and the second attribute information to be identified, wherein the attribute information is used for identifying the matching degree between the video content of the video to be identified and a plurality of attributes under the video content label, so that identification results with different identification granularities can be obtained, and the accuracy of video content identification is further improved.

In one possible implementation, the processing device may determine the video content corresponding to the video to be identified by identifying a key frame image in the video, where the key frame image is a video frame image capable of highlighting the video content.

In this embodiment, the image information of the video to be identified may include a plurality of consecutive video frame images corresponding to the video to be identified, where the consecutive is consecutive in a time dimension. The processing device determines a key frame image of the plurality of video frame images by:

the processing device may determine a pixel mean difference value between adjacent ones of the plurality of successive video frame images, the pixel mean difference value being capable of reflecting a difference between image content of different video frame images. And responding to the fact that the pixel average difference value between the target adjacent video frame images is larger than a preset threshold value, and showing that the image content difference between the target adjacent video frame images is large, wherein the video to be identified may have scene change or transition in the several frame images. The processing equipment can determine the video frame image of the next frame in the target adjacent video frame images as a key frame image, and the key frame image can accurately embody the video content of each scene in the video to be identified;

in addition, in response to that the pixel average difference value between adjacent video frame images does not exist in the plurality of continuous video frame images is greater than the preset threshold, it is indicated that the video to be identified has no scene change or transition within a long time, and it can be indicated that the video content within the period is important to a certain extent.

The processing equipment can determine the first to-be-determined video content label corresponding to the to-be-identified video according to the key frame image, so that content identification is not needed to be carried out on each frame image, and the identification efficiency of the video content is further improved on the premise of ensuring the identification accuracy.

In order to improve the image accuracy of image recognition, in a possible implementation manner, the processing device can perform denoising processing on the key frame image, and then determines a first to-be-determined video content tag corresponding to the to-be-recognized video according to the denoised key frame image.

Also, when identifying video content based on audio information, there may be multiple ways of identifying. In a possible implementation manner, the processing device may determine text information corresponding to the audio information first, for example, the text information corresponding to the audio information may be recognized through a speech recognition technology. Subsequently, the processing device may first perform a sentence division process on the text information, the sentence division process being used to convert the text information into text information in units of sentences, for example, the sentence division process may be performed with punctuation marks (e.g., comma, period).

Subsequently, the processing device may perform word segmentation on word information in the text information identified and processed according to the preset dictionary, that is, the text information is divided into words, and according to the word information, the processing device may determine a second to-be-determined video content tag corresponding to the to-be-identified video.

In order to improve the accuracy of determining the word information, in one possible implementation manner, the processing device may browse the processed text information according to a first order by taking sentences as units, and then determine words in the processed text information, which match a preset dictionary, as first to-be-determined word information; subsequently, the processing device may browse the processed text information in a second order, which is opposite to the first order, in units of sentences, and then determine words in the processed text information that match the preset dictionary as second undetermined word information, so that word recognition in different browsing directions may be implemented. The processing equipment can determine the word information corresponding to the video to be recognized according to the first to-be-determined word information and the second to-be-determined word information, and the accuracy of word recognition is improved.

For example, the processing device may first identify words in the text information from left to right to obtain first to-be-determined word information, where the longest word matched is used as an optimal solution; then, the processing equipment can identify words in the text information from right to left to obtain second undetermined word information, and the matching rate of the words is ensured.

The processing device may analyze semantic tags of the word information obtained by matching based on the dictionary, thereby determining content tags corresponding to the word information, and obtaining a second pending video content tag.

In order to facilitate understanding of the technical solution provided by the embodiment of the present application, a video content identification method provided by the embodiment of the present application will be introduced in combination with an actual application scenario.

Referring to fig. 2, fig. 2 is a schematic diagram of a video content identification method in an actual application scenario according to an embodiment of the present application, where the method may be applied to the system architecture shown in fig. 3, where a processing device may be a server, and the server has a video image identification system therein.

The processing equipment can fill in some parameters, interface addresses, ports and other information of a system to be docked, and can perform docking operation after storage, development and docking configuration of a target system are also required in the docking process, standardized return required by an identification system is required for an interface of the target system, and a subsequent video identification process can be performed only if docking is completed, wherein the docking is used for acquiring a video to be identified.

The video collection module may be started according to the configured system docking module. The method can be divided into a scheme of butt joint with a server and a scheme of butt joint with different servers when in collection, collects the video files of a target system into a video image recognition system, collects video data through the previous system butt joint, and stores the video data into a video system.

After the video data to be identified are obtained, image analysis is carried out on each video through a video analysis module, the video content included in each video is identified, and then the obtained labels and attribute information are stored in each video data. And finally, pushing the analyzed video data to a target system through a data interface module according to the initially configured system docking module to finish data transmission.

The specific identification process is as shown in fig. 2, in the video direction, the processing device may first obtain a key frame, then determine image information corresponding to the key frame, perform image denoising, then perform identification analysis on the image, obtain a label and an attribute value corresponding to the image information, and then assemble the label and the attribute value in each image for analysis; in the audio direction, the processing device can extract audio corresponding to the video data, then extract the audio into text information, perform office segmentation and word segmentation on the text, determine tag information and attribute information corresponding to each word after performing semantic analysis, and then assemble the tag information and attribute information of each word for analysis. Finally, the processing device can synthesize the information obtained by the recognition in the audio direction and the video direction, generate a video image model corresponding to the video and obtain the recognition result of the video content. The processing device can return the video image model and the video content identification result to the target system generating the video, so as to complete the automatic identification of the video content.

Based on the video content identification method provided by the foregoing embodiment, an embodiment of the present application further provides a video content identification apparatus, referring to fig. 4, fig. 4 is a block diagram of a structure of a video content identification apparatus 400 provided by the embodiment of the present application, and the apparatus includes an obtainingunit 401, a first determiningunit 402, a second determiningunit 403, and a third determining unit 404:

the acquiringunit 401 is configured to acquire a video to be identified;

the first determiningunit 402 is configured to determine, according to the image information of the video to be identified, a first to-be-determined video content tag corresponding to the video to be identified;

the second determiningunit 403 is configured to determine, according to the audio information of the video to be identified, a second to-be-determined video content tag corresponding to the video to be identified;

the third determiningunit 404 is configured to determine, according to the first to-be-determined video content tag and the second to-be-determined video content tag, a video content tag corresponding to the to-be-identified video, where the video content tag is used to identify video content corresponding to the to-be-identified video.

In a possible implementation manner, the image information includes a plurality of consecutive video frame images corresponding to the video to be identified, and the first determiningunit 402 is specifically configured to:

the denoising unit is used for denoising the key frame image;

the first determiningunit 402 is specifically configured to:

In a possible implementation manner, the second determiningunit 403 is specifically configured to:

determining text information corresponding to the audio information;

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying video content, the method comprising:

acquiring a video to be identified;

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the image information includes a plurality of consecutive video frame images corresponding to the video to be identified, and the determining a first to-be-determined video content tag corresponding to the video to be identified according to the image information of the video to be identified includes:

4. The method of claim 3, further comprising:

denoising the key frame image;

5. The method according to claim 1, wherein determining a second to-be-determined video content tag corresponding to the to-be-identified video according to the audio information of the to-be-identified video comprises:

determining text information corresponding to the audio information;

6. The method according to claim 5, wherein the recognizing word information in the processed text information according to a preset dictionary comprises:

7. A video content recognition apparatus, characterized in that the apparatus comprises an acquisition unit, a first determination unit, a second determination unit, and a third determination unit:

the acquisition unit is used for acquiring a video to be identified;

8. The apparatus of claim 7, further comprising a fourth determination unit, a fifth determination unit, and a sixth determination unit:

9. The apparatus according to claim 7, wherein the image information includes a plurality of consecutive video frame images corresponding to the video to be identified, and the first determining unit is specifically configured to:

10. The apparatus of claim 9, further comprising a denoising unit:

the denoising unit is used for denoising the key frame image;

the first determining unit is specifically configured to: