CN114037946B

Movatterモバイル変換

Info

Publication number: CN114037946B
Application number: CN202111556380.3A
Authority: CN
Inventors: 孙利娟; 吴京宸; 吴旭; 颉夏青; 李飞; 张熙; 杨金翠; 邱莉榕; 张勇东; 方滨兴
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2025-05-27
Anticipated expiration: 2041-12-17
Also published as: CN114037946A

Abstract

The application discloses a video classification method, a video classification device, electronic equipment and a video classification medium. The method comprises the steps of obtaining video data to be classified, inputting the video data to be classified into an audio-video learning network to obtain image features and audio features corresponding to the video to be classified and text features corresponding to the video to be classified, inputting the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors, inputting the fusion feature vectors into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified. By applying the technical scheme of the application, after the video to be classified is acquired, the image features, the audio features and the text features of the video data are obtained by utilizing a preset learning network model, and the three features are fused, so that the classification result of the video to be classified is judged according to the fused features. Thus, the defect of inaccurate classification of video data in the related art is avoided.

Description

Video classification method, device, electronic equipment and medium

Technical Field

The present application relates to data processing technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for video classification.

Background

With the rapid development of mobile internet technology, the continuous increase of network transmission speed and the continuous progress of compression technology, various multimedia information are continuously emerging, and digital libraries, remote education, video on demand, digital video broadcasting, interactive television, etc. all generate and use a large amount of video data.

On this basis, video classification is also an important research topic in the field of multimedia analysis. Video classification is the basis for many video applications, which provides convenience for the management of increasingly video data. Content-based video retrieval, video summary summarization, video indexing, tagging, and the like are all driving the development of video classification techniques.

However, the automatic classification of video by computer in the related art is low in accuracy, and the video can be classified only manually. This also results in inefficient classification.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a video classification medium. The method is used for solving the problem that video classification cannot be accurately performed in the related art.

According to one aspect of the embodiment of the present application, a method for classifying video is provided, including:

acquiring video data to be classified;

Inputting the video data to be classified into an audio and video learning network to obtain image features and audio features corresponding to the video to be classified;

Inputting the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors;

and inputting the fusion feature vector into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified.

Optionally, in another embodiment of the above method according to the present application, the inputting the image feature, the audio feature, and the text feature into a fusion learning network to obtain a fusion feature vector includes:

vector conversion is carried out on the image feature, the audio feature and the text feature respectively to obtain an image feature vector, an audio feature vector and a text feature vector;

Carrying out vector addition on the image feature vector, the audio feature vector and the text feature vector to obtain a first fusion feature vector, and carrying out product normalization on the image feature vector, the audio feature vector and the text feature vector to obtain a second fusion feature vector

And obtaining the fusion feature vector based on the first fusion feature vector and the second fusion feature vector.

Optionally, in another embodiment of the above method according to the present application, the obtaining the fused feature vector based on the first fused feature vector and the second fused feature vector includes:

generating a plurality of weight coefficient vectors, and carrying out normalization processing on the plurality of weight coefficient vectors to obtain a fusion weight coefficient;

and carrying out weighted summation on the first fusion feature vector and the second fusion feature vector by utilizing the fusion weight coefficient to obtain the fusion feature vector.

Optionally, in another embodiment of the foregoing method according to the present application, the weighting and summing the first fused feature vector and the second fused feature vector by using the fusion weight coefficient to obtain the fused feature vector includes:

Weighting and summing the first fusion feature vector and the second fusion feature vector by utilizing the fusion weight coefficient for a plurality of times to obtain a plurality of primary fusion feature vectors;

self-adaptively updating the fusion weight coefficient by using a loss function to obtain an updated weight coefficient;

and carrying out weighted summation on the plurality of primary fusion feature vectors by using the updated weight coefficient to obtain the fusion feature vector.

Optionally, in another embodiment of the above method according to the present application, the inputting the video data to be classified into a text learning network to obtain text features corresponding to the video to be classified includes:

performing voice recognition on the video data to be classified to obtain a text to be processed;

converting the letter fields and the expression fields contained in the text to be processed into text fields by using a preset conversion rule;

converting the text to be processed containing the text field into a one-hot vector;

and inputting the one-hot vector to the text learning network for deep semantic feature extraction to obtain the text features.

Optionally, in another embodiment of the foregoing method according to the present application, the inputting the video data to be classified into an audio-video learning network to obtain image features corresponding to the video to be classified includes:

Dividing the video data to be classified into a plurality of sub video data according to a preset interval duration;

Respectively extracting a key frame image in each piece of sub-video data to obtain a plurality of key frame image sets;

Sequentially arranging a plurality of key frame images in the key frame image set according to a time sequence to obtain image data to be input;

And inputting the image data to be input into the audio and video learning network to obtain the image characteristics corresponding to the video to be classified.

Optionally, in another embodiment of the above method according to the present application, the inputting the video data to be classified into an audio-video learning network to obtain audio features corresponding to the video to be classified includes:

extracting audio data to be processed contained in the video data to be classified;

and inputting the audio data to be processed into an audio and video learning network to obtain the audio characteristics corresponding to the video to be classified.

According to still another aspect of the embodiment of the present application, there is provided an apparatus for classifying video, including:

The acquisition module is configured to acquire video data to be classified;

the generation module is configured to input the video data to be classified into an audio and video learning network to obtain image features and audio features corresponding to the video to be classified;

The fusion module is configured to input the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors;

And the classification module is configured to input the fusion feature vector into a Softmax classifier, and take a classification result output by the classifier as a classification result of the video to be classified.

According to still another aspect of an embodiment of the present application, there is provided an electronic apparatus including:

A memory for storing executable instructions, and

And a display for executing the executable instructions with the memory to perform the operations of any of the methods of video classification described above.

According to a further aspect of an embodiment of the present application, there is provided a computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of any of the methods of video classification described above.

The method comprises the steps of obtaining video data to be classified, inputting the video data to be classified into an audio/video learning network to obtain image features and audio features corresponding to the video to be classified, inputting the video data to be classified into a text learning network to obtain text features corresponding to the video to be classified, inputting the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors, inputting the fusion feature vectors into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified. By applying the technical scheme of the application, after the video to be classified is acquired, the image features, the audio features and the text features of the video data are obtained by utilizing a preset learning network model, and the three features are fused, so that the classification result of the video to be classified is judged according to the fused features. Thus, the defect of inaccurate classification of video data in the related art is avoided.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The application may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a method for classifying video according to the present application;

FIG. 2 is a diagram of an overall network architecture for video classification in accordance with the present application;

fig. 3 is a schematic structural diagram of an electronic device for video classification according to the present application;

fig. 4 is a schematic structural diagram of an electronic device for video classification according to the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In addition, the technical solutions of the embodiments of the present application may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, the combination of the technical solutions should be considered as not existing, and not falling within the scope of protection claimed by the present application.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear are used in the embodiments of the present application) are merely for explaining the relative positional relationship, movement conditions, and the like between the components in a certain specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicators are changed accordingly.

A method for video classification according to an exemplary embodiment of the present application is described below in conjunction with fig. 1-2. It should be noted that the following application scenarios are only shown for facilitating understanding of the spirit and principles of the present application, and embodiments of the present application are not limited in this respect. Rather, embodiments of the application may be applied to any scenario where applicable.

The application also provides a video classification method, a video classification device, electronic equipment and a video classification medium.

Fig. 1 schematically shows a flow diagram of a method of video classification according to an embodiment of the application. As shown in fig. 1, the method includes:

s101, obtaining video data to be classified.

S102, inputting the video data to be classified into an audio and video learning network to obtain image features and audio features corresponding to the video to be classified, and inputting the video data to be classified into a text learning network to obtain text features corresponding to the video to be classified.

S103, inputting the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors.

And S104, inputting the fusion feature vector into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified.

With the rapid development of mobile internet technology, a large amount of video data is generated. Including long video as well as short video. Taking short videos as an example, nowadays, more and more mobile internet users show their lives and express their perspectives and moods through the short videos, so that the short videos have strong social properties, and are very easy to induce and spread public opinion.

In one approach. With the development of computer vision, particularly convolutional neural networks (Convolutional Neural Network, CNN), significant progress has been made in image classification tasks, and attempts have been made in the related art to solve the video classification problem using deep learning methods. However, the existing method is difficult to obtain expected effects in the classification practice of short videos. The main reason is that the existing classification method is mostly based on the general video data and uses the public data set for training, and the data distribution in different fields has huge difference. Secondly, the existing classification method is too dependent on image information, and key features of the existing classification method are easily covered by noise. In addition, a large number of comments often exist in short videos, and the existing classification models are mostly unaccounted for or cannot effectively utilize semantic information contained in the short video comments. In addition, in terms of short video data sets, currently available public data sets are basically oriented to the general field, and it is difficult to find a suitable public short video data set for the university field.

In order to solve the problems, the application provides a short video classification method (Multimodal Micro-video Classification Based on Multi-network Structure, MMS) based on a multi-network structure. It can be understood that the method and the device can extract the multi-modal characteristics for joint training aiming at the multi-modal information carried by the video data, such as the characteristic extraction network comprising the image data, the audio data and the text data, and finally, the three types of characteristics are fused, and then, the video classification result is obtained according to the fused characteristics.

Furthermore, the overall structure of the video classification method is shown in fig. 2, so as to extract the image features, the audio features and the text features of the video to be classified. The whole flow of the application comprises an audio and video learning network, a text learning network and a fusion learning network. Wherein the audio video learning network may comprise two sub-networks, namely a visual network and an audio network. The system can also be an audio-video learning network for integrally processing audio-video data. The application is not limited in this regard.

In one mode, the audio/video network or the visual network is responsible for capturing behavior information presented in the video, and proper visual behavior characteristics are extracted by learning a large number of behaviors, so that the method can be used for subsequent multi-mode characteristic fusion and single-mode data processing.

In addition, the audio learning network uses convolution to extract different segmentation waveform characteristics of the audio, and sends the segmentation waveform characteristics as a token to a transducer encoder to obtain audio characteristic vectors. The text network has the deep semantic feature extraction capability of the text information. In the embodiment of the application, the video to be classified can be input into different sub-networks to carry out corresponding feature extraction. The method specifically comprises the steps of inputting video data to be classified into an audio/video learning network to obtain image features and audio features corresponding to the video to be classified, inputting the video data to be classified into a text learning network to obtain text features corresponding to the video to be classified, fusing the three features by a fusion learning network, inputting fused feature vectors into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified.

The method comprises the steps of obtaining video data to be classified, inputting the video data to be classified into an audio-video learning network to obtain image features and audio features corresponding to the video to be classified and text features corresponding to the video to be classified, inputting the image features, the audio features and the text features into a fusion learning network to obtain fusion feature vectors, inputting the fusion feature vectors into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified. By applying the technical scheme of the application, after the video to be classified is acquired, the image features, the audio features and the text features of the video data are obtained by utilizing a preset learning network model, and the three features are fused, so that the classification result of the video to be classified is judged according to the fused features. Thus, the defect of inaccurate classification of video data in the related art is avoided. And taking the planning route with the lowest traffic cost as a navigation route. And the aim of selecting the navigation route with the lowest passing cost for the user from the standards of the passing fee and the passing efficiency of the vehicle is further realized.

Optionally, in another embodiment of the above method according to the present application, inputting the image feature, the audio feature, and the text feature into a fusion learning network to obtain a fusion feature vector, including:

In one mode, the audio and video learning network or the visual learning network is responsible for capturing behavior information presented in the video, and proper visual behavior characteristics are extracted by learning a large number of behaviors, so that the method can be used for subsequent multi-mode characteristic fusion and single-mode data processing. In addition, the audio learning network uses convolution to extract different segmentation waveform characteristics of the audio, and sends the segmentation waveform characteristics as a token to a transducer encoder to obtain audio characteristic vectors. The text network has the deep semantic feature extraction capability of the text information.

Specifically, in the embodiment of the application, the video to be classified can be input into different sub-networks to perform corresponding feature extraction. The method specifically comprises the steps of inputting video data to be classified into an audio/video learning network to obtain image features and audio features corresponding to the video to be classified, inputting the video data to be classified into a text learning network to obtain text features corresponding to the video to be classified, fusing the three features by a fusion learning network, inputting fused feature vectors into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified.

Specifically, for obtaining image features corresponding to a video to be classified, in the embodiment of the application, the video data to be classified may be stored in an RGB frame form first, and in order to reduce the calculation amount, certain preprocessing is performed on the video data to be classified. The method may include uniformly dividing the classified video data into a fixed number of sub video data according to time length, and then randomly taking out a frame from each sub video data as a key frame of the sub video data.

In addition, key frames of the extracted sub video data are arranged according to a time dimension order to form an RGB input sequence with the size of T multiplied by H multiplied by W, and image data to be input is obtained. And inputting the image data to be input into an audio and video learning network to obtain the image characteristics corresponding to the video to be classified.

In one mode, the application can firstly extract an audio waveform file from a video source file, obtain potential audio characterization by using a depth convolution network on the waveform file, and obtain a final feature vector by taking the audio characterization as a special morphological input transducer coder of a text. In one mode, the method can extract the audio features by using a standard wav2vec learning model to obtain the audio features corresponding to the video to be classified.

In addition, for obtaining text features corresponding to the video to be classified, the application needs to perform voice recognition on the video data to be classified to obtain the text to be processed, wherein the text to be processed can be converted from audio data appearing in the video to be classified, and can also be comment information, title information, label information and the like aiming at the video to be classified.

Further, because of social short text such as video titles and comments, there are a number of expressions of abbreviations, harmony notes, emoji, etc. that are completely different from traditional written expressions. In this case, a proper preprocessing operation is critical. In one mode, the application can firstly establish the expression and meaning mapping table to replace the special expression contained in the text with the standard text, then replace letters and abbreviations with the first candidate word in the corresponding result of the hundred-degree input method, and finally convert the obtained standard text information into one-hot vector to be input into a text learning network for deep semantic feature extraction to obtain the feature vector.

Specifically, for the construction of a text network, through intensive investigation on the current natural language processing model, the existing BERT model is found to meet the text processing requirement, so that the improved BERT model is referred to in the aspect of structure of the text network. BERT uses a multi-layer transducer bi-directional encoder to derive text depth semantics by coordinating the contextual representations of all layers. Firstly, word segmentation is carried out on an input text to be converted into a token vector, a special mark is used for identifying the first token of a sentence in a classification task, a discontinuous token sequence is separated by using the special mark, and the token after processing is sent into a BERT model to obtain a vector representation of the sentence. Because bi-directional language models are difficult to train in a conventional left-to-right or right-to-left fashion, the BERT model uses a random masking approach to mask portions of the token of the input sequence and predict the masked portions.

Further, since three feature dimensions extracted from video data are different, semantic information contained in the three features is not completely consistent, which also results in classification by directly using three feature vectors, and the obtained results are likely to be contradictory. In order to alleviate the problem of semantic conflict of different modes, the method can select to input the image features, the audio features and the text features into the fusion learning network to obtain the fusion feature vector, so that the fusion feature vector is input into a Softmax classifier in the follow-up process, and the classification result output by the classifier is used as the classification result of the video to be classified.

Specifically for fusing image features, audio features, and text features, the following steps may be included:

And step 1, processing three feature vectors (image features, audio features and text features) by using a multi-layer perceptron to obtain three vectors with consistent dimensions.

Step 2, carrying out normalization processing on the three vectors;

step 3, adding the three vectors, and performing primary fusion to obtain a first fusion feature vector;

Step 4, calculating a Hadamard product of the three vectors, and reducing errors to obtain a second fusion feature vector;

Step 5, randomly generating a plurality of weight coefficient vectors (for example, 4 weight coefficient vectors);

Step 6, normalizing the weight coefficient vectors to ensure that the element sum of the corresponding positions of the vectors is 1, and obtaining a fusion weight coefficient;

step 7, weighting and summing the first fusion feature vector and the second fusion feature vector by using the fusion weight coefficient to obtain a fusion feature vector;

Step 8, carrying out self-adaptive updating on the fusion weight coefficient by using the loss function to obtain an updated weight coefficient;

Step 9, weighting and summing the plurality of primary fusion feature vectors by using the updated weight coefficient to obtain the fusion feature vector

And 10, inputting the fusion feature vector into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified.

By applying the technical scheme of the application, after the video to be classified is acquired, the image features, the audio features and the text features of the video data are obtained by utilizing a preset learning network model, and the three features are fused, so that the classification result of the video to be classified is judged according to the fused features. Thus, the defect of inaccurate classification of video data in the related art is avoided.

Optionally, in another embodiment of the present application, as shown in fig. 3, the present application further provides a device for video classification. The method comprises the following steps:

an acquisition module 201 configured to acquire video data to be classified;

The generation module 202 is configured to input the video data to be classified into an audio-video learning network to obtain image features and audio features corresponding to the video to be classified;

The fusion module 203 is configured to input the image feature, the audio feature and the text feature into a fusion learning network to obtain a fusion feature vector;

The classification module 204 is configured to input the fusion feature vector to a Softmax classifier, and take a classification result output by the classifier as a classification result of the video to be classified.

In another embodiment of the present application, the acquiring module 201 is configured to perform the steps comprising:

Fig. 4 is a block diagram of a logic structure of an electronic device, according to an example embodiment. For example, electronic device 300 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is provided, for example, a memory including instructions executable by an electronic device processor to perform the method of video classification described above, the method including obtaining video data to be classified, inputting the video data to be classified to an audio-video learning network to obtain image features and audio features corresponding to the video to be classified, inputting the video data to be classified to a text learning network to obtain text features corresponding to the video to be classified, inputting the image features, the audio features and the text features to a fusion learning network to obtain a fusion feature vector, inputting the fusion feature vector to a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified. Optionally, the above instructions may also be executed by a processor of the electronic device to perform the other steps involved in the above-described exemplary embodiments. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, there is also provided an application program/computer program product, including one or more instructions executable by a processor of an electronic device to perform the above-mentioned method of video classification, the method including obtaining video data to be classified, inputting the video data to be classified to an audio-video learning network to obtain image features and audio features corresponding to the video to be classified, and inputting the video data to be classified to a text learning network to obtain text features corresponding to the video to be classified, inputting the image features, the audio features and the text features to a fusion learning network to obtain a fusion feature vector, inputting the fusion feature vector to a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified. Optionally, the above instructions may also be executed by a processor of the electronic device to perform the other steps involved in the above-described exemplary embodiments.

Fig. 4 is an example diagram of an electronic device 300. It will be appreciated by those skilled in the art that the schematic diagram 4 is merely an example of the electronic device 300 and is not meant to be limiting of the electronic device 300, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device 300 may also include input-output devices, network access devices, buses, etc.

The Processor 302 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being a control center of the electronic device 300, with various interfaces and lines connecting the various parts of the overall electronic device 300.

The memory 301 may be used to store computer readable instructions 303 and the processor 302 implements the various functions of the electronic device 300 by executing or executing computer readable instructions or modules stored in the memory 301 and invoking data stored in the memory 301. The memory 301 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the electronic device 300, etc. In addition, the Memory 301 may include a hard disk, memory, a plug-in hard disk, a smart Memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash Memory card (FLASH CARD), at least one magnetic disk storage device, a flash Memory device, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or other non-volatile/volatile storage device.

The modules integrated with the electronic device 300 may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by means of computer readable instructions to instruct related hardware, where the computer readable instructions may be stored in a computer readable storage medium, where the computer readable instructions, when executed by a processor, implement the steps of the method embodiments described above.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of video classification, comprising:

acquiring video data to be classified;

inputting the fusion feature vector into a Softmax classifier, and taking a classification result output by the classifier as a classification result of the video to be classified;

The method comprises the steps of inputting video data to be classified into a text learning network to obtain text features corresponding to the video to be classified, carrying out voice recognition on the video data to be classified to obtain text to be processed, converting letter fields and expression fields contained in the text to be processed into text fields by using preset conversion rules, converting the text to be processed containing the text fields into one-hot vectors, inputting the one-hot vectors into the text learning network to carry out deep semantic feature extraction to obtain the text features, establishing an expression and meaning mapping table, replacing the expression fields contained in the text to be processed with standard texts, and replacing letters and abbreviations with first candidate words in corresponding results of an input method;

The method comprises the steps of inputting image features, audio features and text features into a fusion learning network to obtain fusion feature vectors, carrying out vector conversion on the image features, the audio features and the text features to obtain image feature vectors, audio feature vectors and text feature vectors, carrying out vector addition on the image feature vectors, the audio feature vectors and the text feature vectors to obtain first fusion feature vectors, carrying out product normalization on the image feature vectors, the audio feature vectors and the text feature vectors to obtain second fusion feature vectors, and obtaining the fusion feature vectors based on the first fusion feature vectors and the second fusion feature vectors, wherein the mode of obtaining the second fusion feature vectors is that Hadamard products are obtained on the image feature vectors, the audio feature vectors and the text feature vectors.

2. The method of claim 1, wherein the deriving the fused feature vector based on the first fused feature vector and the second fused feature vector comprises:

3. The method of claim 2, wherein the weighting and summing the first fused feature vector and the second fused feature vector using the fused weight coefficients to obtain the fused feature vector comprises:

4. The method of claim 1, wherein the inputting the video data to be classified into an audio-video learning network to obtain the image features corresponding to the video to be classified comprises:

5. The method according to claim 1 or 4, wherein the inputting the video data to be classified into an audio-video learning network to obtain the audio features corresponding to the video to be classified includes:

6. An apparatus for video classification, comprising:

The acquisition module is configured to acquire video data to be classified;

The classification module is configured to input the fusion feature vector into a Softmax classifier, and take a classification result output by the classifier as a classification result of the video to be classified;

The generation module is further configured to perform voice recognition on the video data to be classified to obtain a text to be processed, convert letter fields and expression fields contained in the text to be processed into text fields by using a preset conversion rule, convert the text to be processed containing the text fields into one-hot vectors, input the one-hot vectors into the text learning network to perform deep semantic feature extraction to obtain the text features, establish an expression and meaning mapping table, replace the expression fields contained in the text to be processed with standard texts, and replace letters and abbreviations with first candidate words in corresponding results of an input method;

The fusion module is further configured to perform vector conversion on the image feature, the audio feature and the text feature to obtain an image feature vector, an audio feature vector and a text feature vector, perform vector addition on the image feature vector, the audio feature vector and the text feature vector to obtain a first fusion feature vector, perform product normalization on the image feature vector, the audio feature vector and the text feature vector to obtain a second fusion feature vector, and obtain the fusion feature vector based on the first fusion feature vector and the second fusion feature vector, wherein the method for obtaining the second fusion feature vector is to calculate a Hadamard product on the image feature vector, the audio feature vector and the text feature vector.

7. An electronic device, comprising:

A memory for storing executable instructions, and

A processor operable with the memory to execute the executable instructions to perform operations of the method of video classification of any of claims 1-5.

8. A computer readable storage medium storing computer readable instructions which, when executed, perform the operations of the method of video classification of any one of claims 1-5.