CN118154987A

Movatterモバイル変換

Info

Publication number: CN118154987A
Application number: CN202410466751.6A
Authority: CN
Inventors: 方新宇; 王炜; 解忠乾
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2024-04-17
Filing date: 2024-04-17
Publication date: 2024-06-07

Abstract

The embodiment of the invention provides a training and classifying method, device, medium and equipment of a dynamic data classifying network. The training method sets the single-mode feature extraction network to share the same self-attention layer, bridges the gap between the two modes of vision and language, and combines the characterization learning method and the semi-supervised training method to train the dynamic data classification network, so that the classification network can accurately acquire feature information bridging a plurality of modes, the accurate fusion of multi-mode information included in the dynamic data is realized, and the dynamic data to be classified is accurately classified.

Description

Training and classifying method, device, medium and equipment for dynamic data classifying network

Technical Field

The embodiment of the invention relates to the technical field of information processing, in particular to a training method of a dynamic data classification network, and a dynamic data classification method, device, medium and equipment.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Dynamic data refers to content used to convey a user's immediate emotion, mind, or perspective, such as user dynamics on a social media platform where the user shares personal life, through a detailed textual description in combination with images or videos. In the process of dynamic data release, the dynamic data release platform needs to audit the content of the dynamic data to improve the quality of the data content, and in order to improve the audit efficiency, the dynamic data needs to be subjected to data classification (also called as multi-mode multi-label classification) before audit.

However, based on the fact that the visual mode and the text mode respectively have different characteristic tendencies, when the scheme is adopted, the visual mode and the text mode information fusion effect is poor, the information of the two modes is difficult to accurately and comprehensively use for data classification, the problem that inaccuracy is easy to occur when the model is classified is caused, and the scheme adopts supervised training, so that the model classification effect is poor under the condition that the number of supervised samples is small.

Disclosure of Invention

In view of the foregoing, it is desirable for embodiments of the present invention to provide a training method for a dynamic data classification network, a dynamic data classification method, a device corresponding to the method, a storable medium, and a computing device.

In a first aspect of embodiments of the present invention, there is provided a training and classifying method of a dynamic data classification network, the dynamic data including visual data and text data, the dynamic data classification network including a feature extraction network and a classifier; the feature extraction network comprises a visual feature extraction network and a text feature extraction network; the method comprises the following steps:

acquiring a first dynamic data sample with a classification label;

Performing feature extraction on the first dynamic data sample by using the visual feature extraction network and the text feature extraction network respectively to obtain corresponding visual features and text features, wherein the visual feature extraction network and the text feature extraction network share network parameters of a self-attention layer;

acquiring image-text matching results of the visual features and the text features, and adjusting network parameters of the feature extraction network according to the image-text matching results and the classification labels;

and according to the characteristic extraction network, performing semi-supervised training on the classifier by using the first dynamic data sample and a second dynamic data sample without a classification label.

Optionally, the feature extraction of the first dynamic data sample by using the visual feature extraction network and the text feature extraction network to obtain corresponding visual features and text features includes:

converting the visual data into target data with a set coding format, and acquiring pre-extracted visual features of the target data by using a contrast language-image pre-training model; the set coding format is used for converting the visual data into data in the form of text characters;

And inputting the pre-extracted visual features into the visual feature extraction network to obtain the corresponding visual features.

Optionally, each layer of the visual feature extraction network includes at least a first self-attention layer, a cross-attention layer, and a first feedforward neural network; the parameters of the first self-attention layer include a query index; the inputting the pre-extracted visual feature into the visual feature extraction network to obtain the corresponding visual feature includes:

inputting the output result of the first self-attention layer and the pre-extracted visual features to the cross-attention layer for each layer of the visual feature extraction network; the output result is obtained by processing the received query index by the first self-attention layer;

and inputting the output result of the cross attention layer into the first feedforward neural network, transmitting the output result of the first feedforward neural network to the next layer of the visual characteristic extraction network, and determining the output result of the first feedforward neural network in the last layer as the corresponding visual characteristic.

Optionally, the output result of the first self-attention layer includes an output tensor determined based on the query index; the output tensor is determined according to the query vector, the key vector and the value vector corresponding to the query index by using an attention weight calculation formula.

Optionally, in the case that the output tensor is determined according to an attention weight calculation formula, the first self-attention layer is provided with a plurality of attention heads; the output tensor is obtained by splicing the outputs of the plurality of attention heads corresponding to the query index according to the dimension of the attention heads and processing the splicing result based on a linear change matrix.

Optionally, the acquiring the pre-extracted visual features of the target data using a comparative language-image pre-training model includes:

Responding to the visual data in the form of images, inputting a first image into the contrast language-image pre-training model to obtain the pre-extracted visual features;

or responding to the visual data in the form of video, and acquiring a key frame image of the visual data; and inputting the key frame image into the contrast language-image pre-training model to obtain the pre-extraction visual characteristics.

Optionally, the obtaining the image-text matching result of the visual feature and the text feature includes:

taking the image-text feature pair as the input of image-text matching, and acquiring an image-text matching prediction result of the image-text feature pair and a classification label prediction result to which the image-text belongs; the image-text feature pair comprises the visual feature and the text feature belonging to the same dynamic data.

Optionally, for the input of the image-text feature pair as the image-text matching, the method further comprises:

dividing the image-text feature pairs corresponding to the first dynamic data sample into at least three groups;

determining the first packet as a positive sample;

Replacing text features of each graphic feature pair in the second group with corresponding pseudo text features, and determining the replaced second group as a first negative sample; the pseudo text feature refers to another text feature with the highest similarity with the text feature in the second group;

Replacing the visual features of each graphic feature pair in the third group with corresponding pseudo visual features, and determining the replaced third group as a second negative sample; the pseudo-visual feature refers to another visual feature in the third group that has the highest similarity to the visual feature;

And splicing the positive sample, the first negative sample and the second negative sample to be used as input of image-text matching.

Optionally, the performing semi-supervised training on the classifier according to the feature extraction network by using the first dynamic data sample and a second dynamic data sample without a classification label includes:

inputting the visual features and the text features of the second dynamic data sample extracted by the feature extraction network into the classifier, and determining the classification output by the classifier as a pseudo classification label of the second dynamic data sample;

Training the classifier by using the first dynamic data sample and a second dynamic data sample with a pseudo classification label, and adjusting network parameters of the classifier by minimizing a preset classifier loss function; the classifier loss function comprises a first loss function of a first dynamic data sample and a second loss function of a second dynamic data sample; the second loss function is correspondingly provided with a control weight; the control weights are used to indicate the weights of the second dynamic data samples for network parameter adjustment of the classifier.

Optionally, the control weight is adjusted according to the classification accuracy of the classifier;

Under the condition that the classification accuracy is smaller than a first threshold value, the control weight value is 0; under the condition that the classification accuracy is larger than or equal to a first threshold value and smaller than a second threshold value, the control weight value linearly increases along with the classification accuracy, and the maximum value is a preset super-parameter; and under the condition that the classification accuracy is greater than or equal to a second threshold value, the control weight takes the value as the super parameter.

Optionally, in the process of performing semi-supervised training on the classifier, after the semi-supervised training is completed, the method further includes:

Determining a target classification category of which the classification accuracy is lower than a set threshold value and any performance evaluation index does not meet the set index threshold value by using an evaluation set;

according to the data sources of the evaluation set, respectively determining the classification accuracy of each data source on the classifier;

Acquiring a third dynamic data sample which accords with the data source with the highest classification accuracy corresponding to the data source, and merging the third dynamic data sample into the first dynamic data sample; the third dynamic data sample has a classification tag;

and adjusting the loss function of the classifier according to the target classification category, and alternately training the classifier according to the first dynamic data sample and the second dynamic data sample.

Optionally, the acquiring a first dynamic data sample includes:

Acquiring first dynamic data with a first classification label from a designated data storage system; the first classification label indicates a preset classification category;

Obtaining second dynamic data with a second classification label from the outside, mapping the second classification label to the preset classification category, and replacing the second classification label with a third classification label corresponding to the mapped classification category;

And generating the first dynamic data sample according to the data distribution of each classification category in the first dynamic data and the second dynamic data.

In a second aspect of the embodiments of the present invention, there is provided a dynamic data classification method, the method comprising:

acquiring dynamic data to be classified; the dynamic data includes visual data and text data;

Inputting the dynamic data to be classified into a dynamic data classification network, and obtaining a prediction classification result output by the dynamic data classification network; the dynamic data classification network is obtained by training the training method of the dynamic data classification network;

And determining the classification result of the dynamic data to be classified according to the prediction classification result.

In a third aspect of the embodiments of the present invention, there is provided a training apparatus of a dynamic data classification network, the dynamic data including visual data and text data, the dynamic data classification network including a feature extraction network and a classifier; the feature extraction network comprises a visual feature extraction network and a text feature extraction network; the device comprises:

the sample acquisition module is used for acquiring a first dynamic data sample with a classification label;

the feature extraction module is used for carrying out feature extraction on the first dynamic data sample by utilizing the visual feature extraction network and the text feature extraction network respectively to obtain corresponding visual features and text features, wherein the visual feature extraction network and the text feature extraction network share network parameters of a self-attention layer;

the feature extraction network training module is used for acquiring image-text matching results of the visual features and the text features, and adjusting network parameters of the feature extraction network according to the image-text matching results and the classification labels;

And the classifier training module is used for performing semi-supervised training on the classifier by using the first dynamic data sample and the second dynamic data sample without the classification label according to the characteristic extraction network.

Optionally, the feature extraction module is specifically configured to:

Optionally, each layer of the visual feature extraction network includes at least a first self-attention layer, a cross-attention layer, and a first feedforward neural network; the parameters of the first self-attention layer include a query index; the feature extraction module, when configured to input the pre-extracted visual feature to the visual feature extraction network to obtain the corresponding visual feature, may include:

Optionally, the first self-attention layer is provided with a plurality of attention heads; the output tensor is obtained by splicing the outputs of the plurality of attention heads corresponding to the query index according to the dimension of the attention heads and processing the splicing result based on a linear change matrix.

Optionally, the feature extraction module, when used for acquiring pre-extracted visual features of the target data using a comparative language-image pre-training model, may include:

Optionally, the feature extraction network training module is specifically configured to:

determining the first packet as a positive sample;

Optionally, the classifier training module is specifically configured to:

Optionally, the control weight is adjusted according to the classification accuracy of the classifier; under the condition that the classification accuracy is smaller than a first threshold value, the control weight value is 0; under the condition that the classification accuracy is larger than or equal to a first threshold value and smaller than a second threshold value, the control weight value linearly increases along with the classification accuracy, and the maximum value is a preset super-parameter; and under the condition that the classification accuracy is greater than or equal to a second threshold value, the control weight takes the value as the super parameter.

Optionally, the classifier training module further includes:

Optionally, the sample acquisition module is specifically configured to:

In a fourth aspect of the embodiments of the present invention, there is provided a dynamic data classification apparatus, the apparatus comprising:

The dynamic data acquisition module is used for acquiring dynamic data to be classified; the dynamic data includes visual data and text data;

The classification module is used for inputting the dynamic data to be classified into a dynamic data classification network and obtaining a prediction classification result output by the dynamic data classification network; the dynamic data classification network is obtained by training the training method of the dynamic data classification network;

and the classification result acquisition module is used for determining the classification result of the dynamic data to be classified according to the prediction classification result.

In a fifth aspect of embodiments of the present invention, there is provided a readable storage medium having stored thereon a computer program which when executed by a processor implements the training method or dynamic data classification method of the dynamic data classification network described above.

In a sixth aspect of embodiments of the present invention, there is provided a computing device comprising: comprising the following steps: a processor, a memory; the memory is used for storing a computer program; the processor is used for executing the training method or the dynamic data classification method of the dynamic data classification network by calling the computer program.

According to the embodiment of the invention, the single-mode characteristics are respectively extracted, the single-mode characteristic extraction network is set to share the same self-attention layer, the gap between the two modes of vision and language is closed, the dynamic data classification network can accurately acquire the characteristic information which closes a plurality of modes by combining a characterization learning method, the accurate fusion of the multi-mode information is realized, and meanwhile, a semi-supervision algorithm is used, and the classifier which is accurately classified is obtained through the data training with the classification label and the data without the classification label, so that the problem of poor training effect when the sample data with the classification label is less is solved, the training efficiency is improved, and the requirements of an information auditing party and a user are met.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a network framework diagram of a dynamic data classification network according to an embodiment of the invention;

FIG. 2 schematically illustrates a training method flow diagram of a dynamic data classification network according to an embodiment of the invention;

FIG. 3A schematically illustrates a flow chart for obtaining training samples with classification labels for a dynamic data classification network according to an embodiment of the invention;

FIG. 3B schematically illustrates a flowchart of acquiring training samples with classification tags, as an example of a dynamic data distribution platform, according to an embodiment of the present invention;

FIG. 4 schematically illustrates a flow diagram for acquiring visual features using a visual feature extraction network according to an embodiment of the invention;

FIG. 5A schematically illustrates a network architecture diagram of a feature extraction network of a dynamic data classification network according to an embodiment of the invention;

FIG. 5B schematically illustrates a flow chart for acquiring visual features using a visual feature extraction network according to an embodiment of the invention;

FIG. 6 schematically illustrates a flowchart for obtaining a pattern matching result according to an embodiment of the present invention;

FIG. 7 schematically illustrates a flow chart of a method for semi-supervised training of a classifier according to an embodiment of the present invention;

FIG. 8 schematically illustrates a flow chart of a dynamic data classification method according to an embodiment of the invention;

FIG. 9 schematically illustrates a block diagram of a training apparatus of a dynamic data classification network according to an embodiment of the invention;

FIG. 10 schematically illustrates a block diagram of a dynamic data sorting apparatus according to an embodiment of the present invention;

FIG. 11 schematically illustrates a schematic diagram of a storage medium according to an embodiment of the present invention;

FIG. 12 schematically illustrates a schematic diagram of a computing device according to an embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Thus, the invention may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a training and dynamic data classification method, device, medium and computing equipment of a dynamic data classification network are provided.

In this document, it should be understood that any number of elements in the drawings is for illustration and not limitation, and that any naming is used only for distinction and not for any limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.

Before introducing the training method of the dynamic data classification network provided by the invention, key terms and background technology related to the invention are briefly described.

Multimode: things are expressed or perceived from multiple modalities. The multi-modality may include a modality of homogeneity such as images taken from two cameras separately, or a modality of heterogeneity such as pictures and texts. Typically, the main research modes include "3V", i.e., verbal (text), voice, visual, and in the present invention, multi-modal refers to the Visual and text included in the dynamic data.

Multi-example learning: a special supervised learning approach is applied to a special form of data, i.e. "multiple instance" data, where one sample is represented as an instance pocket, rather than a single sample. The example bag contains a plurality of examples, each of which may be either positive or negative examples. In a traditional supervised learning task, each sample has an explicit label, while in a multi-instance learning, the instance bag has a label, while the label inside the instance may be unknown or partially known. The goal of multi-instance learning is to learn a model from the instance bag, enabling classification of new instance bags.

Multilayer perceptron: an artificial neural network with a forward structure maps a group of input vectors to a group of output vectors, and the artificial neural network consists of a plurality of node layers, wherein each layer is fully connected to the next layer; except for input nodes, each node is a neuron with a nonlinear activation function, and the MLP is trained by using a supervised learning method of a back propagation algorithm, so that the MLP is popularized as a sensor, the defect that the sensor cannot recognize linear indistinguishable data is overcome, and a nonlinear discriminant can be realized.

Semi-supervised learning: a learning paradigm interposed between supervised and unsupervised learning. In semi-supervised learning, model training is performed using both labeled and unlabeled data. Labeled data is a sample with an explicit label, while unlabeled data is a sample without an explicit label. The semi-supervised learning aims at utilizing the information of the non-tag data, expanding the tag data by effectively utilizing the non-tag data, improving the learning performance and helping to improve the generalization capability and performance of the model.

CLIP (Contrastive Language-Image Pre-training, comparative language-Image Pre-training model): an advanced multimodal pre-training model developed by OpenAI utilizes a contrast learning approach to simultaneously understand the associations between text and images. The Chinese-CLIP is a multi-mode pre-training model based on 2 hundred million Chinese native graphics pairs, and the training data set comprises Chinese data from LAION-5B Chinese subsets and Wukong, and translation graphics data from COCO and Visual Genome, and the like, so that graphics and text characteristics and similarity calculation, cross-mode retrieval and zero-sample picture classification in the Chinese field can be rapidly realized.

Self-attention mechanism: for focusing on different positions in the input sequence and calculating the correlation of each position with all other positions, a contextual representation of each position can be obtained by weighted summing the correlations between the different positions.

Cross-attention mechanism: for processing a relationship between two different input sequences, information in one input sequence can be correlated to another input sequence to extract a correlation between the two input sequences.

The self-integration method comprises the following steps: the method in semi-supervised learning improves model performance by using label-free data through multiple training and integration; the idea of the self-integration method is to train multiple models using part of the label data and no label data during the training process and then integrate the predictions of these models.

Dynamic data refers to a data form for expressing the instant emotion and idea of a user through videos or images and corresponding text descriptions, wherein the videos or images belong to visual mode information, and the text descriptions belong to text mode information. For example, on a social media platform, a user may publish dynamic content containing multiple pictures and related descriptions to show their own life or perspective. In the process of dynamic data release, data classification is needed before auditing the content of the dynamic data so as to improve auditing efficiency, and the data classification is also called multi-mode multi-label classification based on the dynamic data including information of different modes.

The method for realizing multi-mode multi-label classification in the related art can comprise the following schemes:

Scheme one: by introducing the feature representation of the deep neural network learning dynamic data, the similarity constraint between modes is used for simulating the internal relation between different modes, and the supervised training of the multi-mode data with the labels is introduced for capturing the label dependence, so that the multi-mode multi-label classification is realized.

However, the visual mode and text mode information fusion effect of the scheme is poor, the scheme has better performance in a VOC2007 (Pascal Visual Object Class ) data set and a LabelMe data set, is suitable for target identification tasks in images, is difficult to accurately and comprehensively utilize information of the two modes for data classification, and has low accuracy of model classification results.

Scheme II: the method comprises the steps of organizing data of different modes into a plurality of packets by using a Multi-example learning idea, organizing the data of different modes into a plurality of packets, wherein each packet comprises a plurality of examples, obtaining consistency representation and prediction of packet levels of the different modes by using a Multi-example packet processing layer by using an end-to-end frame M3DN (Multi-modal Multi-instance Multi-label Deep Network, deep Multi-example Multi-label deep network), simultaneously deducing classification labels of each packet by combining packet level prediction results output by a model based on an optimal transmission theory, and obtaining final classification labels of the whole Multi-mode data by using different strategies such as voting, average values and the like according to the classification labels of each packet.

However, the text mode of the dynamic data comprises data of a single instance, namely descriptive text, and the scheme utilizes multi-instance learning to eliminate the problems of inconsistent performance and ambiguous relation among modes, is not suitable for the data classification task of the dynamic data, and cannot accurately fuse the text mode and the visual mode information, so that the classification result of the dynamic data is low in accuracy.

Scheme III: the multi-modal convolutional neural network-maximum marginal cross-correlation learning method is utilized, correlation among class labels is utilized in a later convolutional layer and a full-connection layer, labels are grouped, each image is expressed as a bag of visual examples, example correlation inside a single image is obtained, the single visual examples are combined to generate a multi-modal example by combining group description, the group description is regarded as a context environment, and the combination of the visual modes and the text modes is realized, so that the multi-modal example and the context environment are utilized, and corresponding multi-label output is obtained based on the correlation of learned characteristics and the labels.

However, the scheme cannot accurately extract advanced features required by each tag alone, and the core of the scheme is that the correlation of a plurality of example objects and the resolution learning of visual similar objects in a single image are low in correlation with text modal information, so that the visual information and the text information cannot be combined accurately.

In summary, based on the scheme provided by the related technology, accurate data classification is difficult to accurately and comprehensively utilize information of different modes, the invention provides a training method of a dynamic data classification network, which utilizes the single-mode feature extraction respectively and sets the single-mode feature extraction to share the same self-attention layer, and the difference between the vision and the language modes is closed, and the dynamic data classification network can accurately acquire the feature information which is closed to a plurality of modes by combining the characterization learning method, so that the accurate fusion of the multi-mode information is realized, and meanwhile, a semi-supervision algorithm is used, and an accurate classification classifier is obtained through the data training with classification labels and without classification labels, so that the problem of poor training effect when the sample data with the classification labels is less is solved, the training efficiency is improved, and the requirements of information auditing parties and users are met.

The invention further provides a multi-mode data classification method based on the dynamic data classification network obtained by training the training method, the dynamic data to be classified is input into the trained dynamic data classification network, and the classification result of the dynamic data is obtained according to the prediction classification result output by the dynamic data classification network, so that the accurate classification of the dynamic data by utilizing the characteristic fusion result of the text mode and the visual mode is realized.

Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

In the invention, the characteristic extraction network is firstly trained and network parameters are adjusted by using a characteristic learning method, and after the characteristic extraction network training is finished, the classifier is semi-supervised and trained by using a first dynamic data sample with a classification label and a second dynamic data sample without a separate label. As shown in fig. 2, the training method of the dynamic data classification network provided by the present invention may include the following steps:

s201, acquiring a first dynamic data sample with a classification label;

The classification label is a preset mark or label for identifying the category to which the dynamic data sample belongs, and can be a number, a word or other forms of identification; based on the method for training a classification network of dynamic data, the first dynamic data sample is sample data of known classification results, including visual data and text data.

The first dynamic data sample can be obtained from a specified data storage system or obtained from external data with classification labels, such as a web crawler, and a data set of the first dynamic data sample is constructed as a training sample. In the process of acquiring the data with the classification labels, data acquisition can be performed according to the distribution condition of classification categories of the data with the classification labels, so that sample data of each classification category can be balanced.

S202, performing feature extraction on the first dynamic data sample by using the visual feature extraction network and the text feature extraction network respectively to obtain corresponding visual features and text features, wherein the visual feature extraction network and the text feature extraction network share network parameters of a self-attention layer;

The visual feature extraction network is used for extracting corresponding visual features from dynamic data, after the visual feature extraction network is constructed, network parameters of each layer of the visual feature extraction network are initialized, and the network parameters of the self-attention layer of the visual feature extraction network comprise query indexes used for guiding the self-attention layer to learn how to pay attention to different parts in input data. The query index is used as a learnable weight matrix to guide the attention distribution of the network in the self-attention layer and is used for helping the network to dynamically adjust the attention degree of different parts in the input data according to the current task requirement. When initializing network parameters, the query index in the first layer of the visual feature extraction network can use a random number generation method, such as uniformly distributed random numbers or normally distributed random numbers, to generate random vectors with specified dimension and range requirements as the query index, and the query indexes in other layers except the first layer are output results of the previous layer of the visual feature extraction network.

Inputting the first dynamic data sample or visual data included in the first dynamic data sample into the visual feature extraction network to obtain visual features output by the last layer of the visual feature extraction network; alternatively, feature extraction may be performed on the visual data of the first dynamic data sample, where the visual data, such as an image or video in the dynamic data, is converted into a tensor, and then the converted tensor is input to the visual feature extraction network, where the process of converting the visual data into the tensor may be implemented by various libraries and tools, for example, a library NumPy or TensorFlow in Python, or may be implemented by performing feature extraction through a CLIP model. When the CLIP model is used to implement conversion, visual data in the first dynamic data sample may be directly input to the CLIP model, or the visual data may be first converted into a set coding format, such as base64 coding, and then the coded data is input to the CLIP model.

The text feature extraction network is used for learning text semantic information, after the construction of the text feature extraction network and the initialization of network parameters are completed, text data of the first dynamic data sample can be input into the text feature extraction network, or the text data can be spliced into a long text and subjected to word segmentation processing, and the word segmentation processing result is input into the text feature extraction network, so that the text feature extraction network utilizes a self-attention layer and other structural layers such as a feedforward neural network layer to map the input of the text feature extraction network into new text features, and the new text features are sent to the next layer or are output as the text feature extraction network.

In this embodiment, based on the network parameters shared by the self-attention layer of the text feature extraction network and the corresponding self-attention layer of the visual feature extraction network, the network parameters of the self-attention layer can learn the association information and the common feature representation of the visual data and the text data in the training process, and the network parameters can be simultaneously applied to the visual data and the text data, so that the feature extraction network of the dynamic data classification network can learn how to dynamically associate the visual data and the text data with each other in the self-attention layer, thereby realizing effective interaction and fusion between the two.

S203, obtaining image-text matching results of the visual features and the text features, and adjusting network parameters of the feature extraction network according to the image-text matching results and the classification labels;

The image-text matching refers to mapping the visual features and the text features to a shared low-dimensional representation space by utilizing representation learning, judging whether the visual features and the text features belong to the same subject or category in the representation space, and outputting prediction classification labels corresponding to the matched visual features and the text features. In this embodiment, the image-text matching may include any one of a character learning task of image-text matching, image text contrast learning, and image generation text.

The image-text matching result at least comprises a prediction classification label corresponding to the matched visual feature and the text feature, and the network parameters of the visual feature extraction network and the text feature extraction network are adjusted by minimizing a preset feature extraction loss function and a back propagation algorithm based on the real matching condition of the prediction classification label and the matched visual feature and the text feature and the difference of the classification label, until a training convergence condition such as the iteration number reaches a set threshold or the loss function converges.

The feature extraction loss function may include a loss of correct matching of the image feature and the text feature, and a loss of incorrect matching, wherein the loss of correct matching refers to a loss of correct matching of the image feature and the text feature and correct prediction classification label, and the loss of incorrect matching refers to a loss of incorrect matching of the image feature and the text feature or incorrect prediction classification label; the loss of matching correctness can use a cross entropy loss function to measure the difference between the predictive classification label and the true classification label; the loss of matching error may set a corresponding loss function based on the matching score and the threshold.

S204, according to the characteristic extraction network, performing semi-supervised training on the classifier by using the first dynamic data sample and the second dynamic data sample without the classification label.

The second dynamic data sample without the classification tag may include data without the classification tag acquired from the designated data storage system, and may further include data after the data with the classification tag is acquired from the outside and subjected to the classification tag removing process, so as to construct the second dynamic data sample.

After the feature extraction network training is completed, network parameters of the feature extraction network are frozen, the classifier is trained by using a training mode of semi-supervised learning, and the semi-supervised training step is repeatedly executed until the performance convergence or loss function of the classifier on the verification set converges to a smaller value.

The classifier can be a multi-layer perceptron, the input of the classifier is visual features and text features output by the feature extraction network, the features in two dimensions are spliced and then input into the multi-layer perceptron, the output dimension is adjusted to be a preset classification category number, and the output is transmitted into an activation function for normalization, so that the output classification category is obtained. Or by deepening the layer number of the classifier, jump connection is added between each layer of encoder and decoder of the classifier, so that the pressure of the higher layer of the classifier to express details is relieved, and the unsupervised learning and the supervised learning can be combined.

In semi-supervised learning, the classifier can be initially trained by using a first dynamic data sample with a label in a supervised learning mode so as to initialize parameters of the classifier, and a good parameter basis is provided for subsequent semi-supervised learning.

The semi-supervised learning method may include a pseudo-label method, which refers to training a classifier on a second dynamic data sample without a classification label, taking the prediction result of the classifier on each data sample as a pseudo-label, combining the second dynamic data sample with the pseudo-label with the first dynamic data sample, and then retraining the classifier. The self-training is used as a specific pseudo-label method, a classifier obtained by training on data with a classification label is used for predicting a second dynamic data sample without the classification label, a prediction result with high confidence is used as a pseudo-label, the pseudo-label is used for expanding the first dynamic data sample data with the label, and the self-training process is iterated to continuously update the classifier and the pseudo-label.

In the embodiment of the invention, the visual feature extraction network and the text feature extraction network are respectively arranged to respectively extract the visual feature and the text feature, and the two extraction networks are arranged to share the network parameters of the self-attention layer, so that the network parameters of the self-attention layer can learn the association information of the visual mode and the text mode simultaneously in the training process, the attention degree of different parts is automatically adjusted to better integrate different mode information, the attention of the feature of each mode information extracted by the extraction network is kept, the feature extraction network training is completed by utilizing a mode of characteristic learning combined with a loss function, and then a classifier which is accurately classified by using a semi-supervision algorithm through a small amount of data is obtained, thereby obtaining a dynamic data classification network which can bridge the feature information of a plurality of modes and realize the effective interaction of the visual feature and the text feature.

In addition, the feature extraction network and the classifier are trained separately, so that the whole dynamic data classification network has higher reusability and flexibility, the same feature extraction network can be reused on different tasks, and only the classifier part needs to be adjusted to adapt to the different tasks.

In some alternative embodiments, as shown in FIG. 3A, the first dynamic data sample may be obtained by:

S301, acquiring first dynamic data with a first classification label from a specified data storage system; the first classification label indicates a preset classification category;

The specified data storage system can comprise a data storage system of a business side platform needing dynamic data classification, wherein the data storage system stores dynamic data with classification labels already set and dynamic data without classification labels in a history period; the preset classification class refers to a class set by the service side platform.

For the dynamic data stored in the designated data storage system, the dynamic data sample can be screened and cleaned through a construction rule, such as deleting nonsensical dynamic data or dynamic data with forbidden words, and the dynamic data is divided according to whether the dynamic data has classification labels, the dynamic data with the classification labels is used as part of data of a first dynamic data sample, and the dynamic data without the classification labels is used as part of data of a second dynamic data sample required by semi-supervised training of a classifier.

S302, second dynamic data with a second classification label is obtained from the outside, the second classification label is mapped to the preset classification category, and the second classification label is replaced by a third classification label corresponding to the mapped classification category;

Based on the fact that the data volume of dynamic data with a classification tag in a specified data storage system is small, partial classification categories have the problem that the data volume is insufficient or the dynamic data of the classification tag is missing, a crawler or other technology can be used for acquiring similar dynamic data from an external platform or equipment outside the specified data storage system at the same time to supplement the data.

For second dynamic data with second classification labels acquired from outside, the second dynamic data can be screened and cleaned, then the second classification labels are mapped to corresponding preset classification categories in a designated data storage system, and the classification labels of the mapped classification categories are redetermined as the classification labels of the second dynamic data so that the second dynamic data can fall into the preset classification categories.

S303, generating the first dynamic data sample according to the data distribution of each classification category in the first dynamic data and the second dynamic data.

After the first dynamic data and the second dynamic data are obtained, the second dynamic data can be used for supplementing each category according to the category distribution condition of the first classification label of the first dynamic data so as to balance the dynamic data quantity distribution of each classification category, and the dynamic data with the classification label after the classification category supplementation is determined to be the first dynamic data sample so as to be used for training the feature extraction network.

In the embodiment of the invention, similar dynamic data are obtained from the outside and mapped to the existing classification labels, so that the workload and cost of manual labeling are reduced, the diversity and richness of an original data set are increased, the problem of unbalanced data is solved, the recognition capability of the model for each class is improved, and the generalization capability and performance of the model are improved.

Based on the above-described content related to the acquisition of the first dynamic data sample and the second dynamic data sample, the present embodiment further exemplifies the acquisition of the data sample with the dynamic data distribution platform.

Referring to the training data set acquisition flow shown in fig. 3B, the acquisition of the first dynamic data sample and the second dynamic data sample is divided into two phases: and (3) acquiring, cleaning and dividing in-station marking data (namely dynamic data samples with classification labels in a dynamic data release platform), constructing evaluation data, and crawling and introducing external data.

The first stage: in-station marking data acquisition and cleaning division:

(1) Constructing anti-garbage anti-blank rules through a background database of the dynamic data release platform, and collecting all dynamic data in a past period of time and dynamic data samples of the artificial labeling classification labels; (2) Screening and cleaning the matched data through a construction rule to obtain enough dynamic data; (3) The data with the label and the data without the label are segmented, and the category distribution of the data with the label is analyzed, so that the follow-up data supplement is facilitated.

For example, referring to the preset classification categories shown in table 1, it is assumed that the classification categories are classified into four large categories and twenty-two fine categories, each of which is assigned to one large category:

TABLE 1

The classification category distribution in the dynamic data release platform refers to that the data with the classification labels in the release platform are divided according to the twenty-two fine classifications, and the number of data samples with the labels on each fine classification is obtained.

And a second stage: evaluation data construction and external data crawling and introduction:

(1) Extracting 5-10 pieces from each fine classification in the station data set with the tag to construct an evaluation data set; (2) Crawling off-site data with labels, namely other social webpages or labeled data of platforms except the dynamic data release platform, by using a crawler technology, and screening and cleaning the crawled data, so that the labels of the off-site data can be mapped into the fine classifications shown in the table 1; (3) And combining classification category distribution of the data set with the tag in the station, supplementing the data outside the station, so that the data quantity among all the fine classification categories is balanced as much as possible, and taking the data with the tag after the classification category supplementation and equalization as a first dynamic data sample.

For the second dynamic data sample without the classified label, the data without the label in the station is taken as the main data, the data sample with the label outside the station and removed from the label is supplemented, and the data is clear and de-duplicated, so that the second dynamic data sample is constructed.

In some alternative embodiments, the dynamic data according to the present invention is classified into a multi-mode and multi-label classification task, and the dynamic data includes at least one piece of text information and a plurality of images, the visual information is relatively complex and difficult to integrate, in order to facilitate analysis of the association between the visual mode and the text mode of the dynamic data and better fusion of the data of different modes, for the aforementioned step S202, the feature extraction is performed on the first dynamic data sample by using the visual feature extraction network and the text feature extraction network, so as to obtain the corresponding visual feature and text feature, which may be implemented in the following manner, as shown in fig. 4:

S401, converting the visual data into target data with a set coding format, and acquiring pre-extracted visual features of the target data by using a CLIP model; the set coding format is used for converting the visual data into data in the form of text characters;

Converting visual data to data in the form of text characters involves converting pixel information in an image to text data, which may be accomplished by encoding format conversion. The set encoding formats may include, but are not limited to, base64, base32, base85, base91, ASCII85, etc., and the encoding formats may have different applications in different scenarios, and may be selected according to the characteristics of the data and the transmission requirements.

The Base64 code is used to represent any binary data in 64 characters for transmission of text data over a network, and the Base64 code may be selected in this embodiment to convert image data into text character form data for transmission and processing between various systems. When converting the visual data, namely the images or videos in the dynamic data, into the target data of Base64 coding, firstly converting the visual data into binary data, then carrying out Base64 coding on the binary data, and finally obtaining the data in the form of text characters.

When the CLIP model is used to obtain the pre-extracted visual feature of the target data with the set encoding format, if the visual data in the first dynamic data sample is data in an image form (i.e., a plurality of images), all the images can be input into the CLIP model to obtain the pre-extracted visual feature, or the first image can be input into the CLIP model to obtain the pre-extracted visual feature because the first image in the dynamic data has a representative property; and under the condition that the visual data in the first dynamic data sample is data in a video form (namely a video segment), firstly acquiring a key frame image of the visual data, and then inputting the key frame image into the CLIP model to obtain the pre-extraction visual characteristic.

In this embodiment, the feature extraction effect of the higher-order CLIP model and the base parameter version CLIP model in the task is the same and the processing time of the base parameter version model is less, so that the CLIP model with the base parameter version, that is, the CLIP model with smaller parameter number and shallower network structure is adopted to extract the feature in the visual data.

S402, inputting the pre-extracted visual features into the visual feature extraction network to obtain the corresponding visual features.

Before the pre-extracted visual features are input into the visual feature extraction network, the pre-extracted visual features, the text data subjected to corresponding word segmentation and the classification labels of the first dynamic data sample can be aligned, and the aligned pre-extracted visual features and the text data are respectively sent into the corresponding feature extraction network. The alignment operation is used for preprocessing pre-extracted visual features and text data to enable the pre-extracted visual features and the text data to have consistent feature representation or input format so as to meet the input requirements of a feature extraction network, such as scaling, clipping, transformation or normalization operation on image features to ensure that all image data have similar features such as sizes, color channels and the like, effectively comparing and processing images with different sizes or styles in the visual feature extraction network, clipping operation on the text data to limit the length of the text data, and cutting or filling the text data to a fixed length so as to ensure that the text data is processed to have uniform input size and avoid mismatching or difficulty in calculation caused by inconsistent text lengths.

In some embodiments, after obtaining the pre-extracted visual feature, an image mask of all 1 s may also be generated according to the pre-extracted visual feature, that is, a matrix having the same size as the pre-extracted visual feature and element values of 1 are generated, and the image mask is transmitted as input to a self-attention layer and a cross-attention layer to instruct the visual feature extraction network to pay attention to all feature positions when processing the image feature, so as to initialize network parameters of each layer of the feature extraction network based on the image mask.

In the embodiment of the invention, the visual data is firstly converted into the coding format of the target data, so that the computational complexity can be reduced, the risk of overfitting is reduced, pre-extracted visual features are acquired based on the data after coding, the semantic information of the visual data can be better captured by using rich feature representations learned on large-scale data by a pre-training model, rich feature vectors are extracted, cross-modal learning between texts and vision can be realized, and a data basis is provided for the association between the text description and visual content of the network understanding of the subsequent feature extraction.

In some embodiments, referring to the network structure of the visual feature extraction network shown in fig. 5A, each layer of the visual feature extraction network may include at least a first self-attention layer, a cross-attention layer, and a first feedforward neural network; the learnable parameters corresponding to the first self-attention layer comprise query indexes, wherein the query indexes in the first layer of the network are obtained by random initialization, and the query indexes of other layers except the first layer are output results of the first feedforward neural network of the corresponding previous layer. The query index is used for indicating the attention degree of the self-attention layer learning of the visual feature extraction network to different parts in the input data, and the self-interaction of the self-attention layer can be realized by utilizing the query index and the weight matrix of the self-attention layer, so that each element token of the input of the self-attention layer learns the relevance and self-attention content of other tokens.

After the first dynamic data sample is obtained, a mode of extracting pre-extracted visual features by using a CLIP model shown in fig. 4 or other modes capable of extracting corresponding pre-extracted visual features from images or videos of dynamic data, for example, a mode of obtaining the pre-extracted visual features by using some large language models such as GPT-4 and the like, and obtaining the pre-extracted visual features of visual data in the first dynamic data sample first; based on the extracted pre-extracted visual features, the step S202 of obtaining the corresponding visual features by using the visual feature extraction network, namely, inputting the pre-extracted visual features into the visual feature extraction network to obtain the corresponding visual features; based on the structure of the visual feature extraction network, as shown in fig. 5B, pre-extracted visual features can be input into the network and visual features can be obtained by:

S501, for each layer of the visual feature extraction network, inputting an output result of the first self-attention layer and the pre-extracted visual features into the cross-attention layer; the output result is obtained by processing the received query index by the first self-attention layer;

in this embodiment, the input of the self-attention layer is the query index, the input of the cross-attention layer is the output result of the self-attention layer and the pre-extracted visual feature corresponding to the visual data in the dynamic data sample, the cross-attention layer can extract the correlation between the output result and the pre-extracted visual feature, and the two are input to the cross-attention layer so that the cross-attention layer interacts the two inputs to further learn the visual feature.

The output result of the self-attention layer is obtained according to a query index of the self-attention layer, and the output result can comprise an output tensor determined based on the query index, wherein the output tensor can be obtained according to a query vector, a key vector and a value vector corresponding to the query index by using an attention weight calculation formula, and the corresponding query vector, key vector and value vector are obtained after the query index is subjected to linear change by a weight matrix of the self-attention layer.

The learnable parameters of the self-attention layer at least comprise a query index X and a weight matrix W^Q、W^K、W^V, and the learnable parameters are optimized continuously along with the training process. Based on the learnable parameters, the output tensor is the output of equation (1) of the attention weight:

wherein q=xw^Q、K＝XW^K、V＝XW^V,d_k is the dimension of Q and K for scaling the attention, Q represents the query vector for measuring the association degree of the current position with other positions; k represents a key vector for providing information of other locations for calculating attention weight; v represents a numerical vector containing information multiplied by an attention weight, and the self-attention mechanism calculates the attention weight from the correlation of the query vector Q, the key vector K, and the numerical vector V for determining which parts of the input sequence the network should focus on.

Or in order to process multiple points of interest of query index and past feature information to fully understand input information, a multi-attention-head self-attention mechanism may be adopted, i.e. a plurality of attention heads are arranged in the first self-attention layer, the input of the self-attention layer is split into a plurality of sub-tensors, each sub-tensor represents attention calculation of a head, each attention head has a Q, K, V weight matrix obtained by independent learning, and the weight matrix of independent learning enables each attention head to pay attention to different parts in the input and learn different representations.

And under the condition that a plurality of attention heads are arranged on the first self-attention layer, the output tensor is obtained by splicing the outputs of the plurality of attention heads corresponding to the query index according to the dimension of the attention heads and processing the splicing result based on a linear change matrix.

That is, when calculating the output result of the self-attention layer, attention weights are calculated for each attention head respectively, and the input values are weighted and summed by using the weights to obtain the output result of each head, and finally the output results of the heads are spliced together to generate the final output tensor.

The attention weight calculation formula of the ith attention head_i shown in the following formula (2) is referred to, i is a value of [1, h ], a vector corresponding to the query index after linear transformation of the weight matrix Q, K, V under the attention head is obtained, and the attention weight of the attention head is calculated based on the vector:

head_i＝Attention(XW_i^Q,XW_i^K,XW_i^V) (2)

After obtaining the self-attention weight of each attention head, an output tensor obtaining formula is shown as a formula (3), the output of each attention head is spliced according to the dimension of the attention head, and the splicing result is processed based on a linear change matrix to generate an output tensor under a multi-attention head mechanism:

Multihead(X)＝Concat(head₁,……,head_h)W^o (3)

wherein, the W^o is a linearly variable weight parameter, which can be set according to the actual requirement.

S502, inputting the output result of the cross attention layer into the first feedforward neural network, transmitting the output result of the first feedforward neural network to the next layer of the visual feature extraction network, and determining the output result of the first feedforward neural network in the last layer as the corresponding visual feature.

The cross attention interacts the output result with the frozen pre-extracted visual features after receiving the output result of the first self attention layer, and the attention weight calculation of the cross attention layer can be seen in the calculation process shown in the following formula (4):

Wherein X 'is an output result of the first self-attention layer, Y is the pre-extracted visual feature, Q' =xw '^Q,K′＝V′＝XW′^K,d_r is a dimension of the key value vector K' as a parameter in a denominator for scaling the attention distribution.

The cross-attention layer first calculates an attention score matrix Q 'K'^T, which is divided by the attention score matrixScaling is performed, a softmax function is applied to the scaled attention score matrix to obtain attention weight distribution, and the weighted sum is performed on the value matrix (V') by using the attention weight distribution to obtain final cross attention output. Through the above calculations, the cross-attention layer can cross-attention the two inputs.

The first feedforward neural network is used for carrying out nonlinear conversion on the context representation of each position, the input high-level abstract features can be extracted layer by stacking a plurality of encoder layers, at least the high-level abstract features can comprise a full-connection layer and an activation function, the output result of the cross attention layer is used as the input of the first feedforward neural network, and the output result is mapped into a new query index through the feedforward neural network processing and is sent to the next layer or used as the visual features output by the visual feature extraction network to carry out subsequent training tasks.

In the embodiment of the invention, based on the parameter sharing of the self-attention layer in the same layer of the visual feature extraction network and the text feature extraction network, the self-attention layer in the training process can learn visual and text information at the same time, the input of the self-attention layer of the visual feature extraction network is set as the query index, the input of the cross-attention layer is the output result of the self-attention layer and the pre-extraction visual feature, so that the visual feature can receive the guidance and adjustment of the text information, the relevance of the visual and text is enhanced, the obtained visual feature has better characterization capability, and the understanding capability of the feature extraction network on multi-mode tasks is enhanced.

In some alternative embodiments, the network structure of the text feature extraction network shown in fig. 5A may further include at least a second self-attention layer and a second feedforward neural network layer, where the second self-attention layer shares network parameters of a first self-attention layer in a corresponding layer of the visual feature extraction network; based on the structure of the text feature extraction network, the text feature extraction network used in the step S202 is used to obtain the corresponding text feature, which may be implemented in the following manner:

the text information is spliced into a long text and word segmentation is carried out, and a word segmentation result is obtained;

Inputting the word segmentation result to the second self-attention layer, so that the second self-attention layer determines an output result by using an attention weight calculation formula according to a query vector, a key vector and a value vector corresponding to the word segmentation result and transmits the output result to the second feedforward neural network;

and transmitting the output result of the second feedforward neural network to a second self-attention layer of a next layer of the text feature extraction network, and taking the output result of the second feedforward neural network in a last layer of the text feature extraction network as the text feature.

In the embodiment of the invention, the combined training of the visual mode and the text mode is realized by introducing the network parameters of the second self-attention layer and the shared self-attention layer, so that the text feature extraction network is more suitable for complex semantic tasks, and the performance and generalization capability of the model are improved.

In some alternative embodiments, the obtaining the image-text matching result of the visual feature and the text feature in the step S203 may be implemented by:

That is, for each dynamic data of the first dynamic data sample, the visual feature and the corresponding text feature form a graphic feature pair, so that the graphic feature pair is used as input of graphic matching, the graphic matching can predict whether the visual feature and the text feature in the input graphic feature pair are matched, a corresponding matching prediction result is output, the visual feature and the text feature which are predicted to be matched are used as a whole for performing belonging classification label prediction, a corresponding classification label prediction result is output, and therefore loss function calculation is performed according to the actual matching condition and the actual classification label result.

Or in some alternative embodiments, as shown in fig. 6, the obtaining the image-text matching result of the visual feature and the text feature in the foregoing step S203 may be further implemented by the following manner:

s601, dividing the image-text feature pairs corresponding to the first dynamic data sample into at least three groups;

Dividing the first dynamic data sample into at least three batches, wherein each batch comprises a certain number of dynamic data samples, and the image-text characteristic pairs corresponding to the dynamic data samples in the batches form a group.

S602, determining the first group as a positive sample;

Based on the image-text feature pairs including the visual feature and the text feature belonging to the same dynamic data, each image-text feature pair in any one group is corresponding matching of the visual feature and the text feature, and any group matched correspondingly is determined to be a positive sample.

S603, replacing text features of each graphic feature pair in the second group with corresponding pseudo text features, and determining the replaced second group as a first negative sample; the pseudo text feature refers to another text feature with the highest similarity with the text feature in the second group;

S604, replacing the visual features of each graphic feature pair in the third group with corresponding pseudo visual features, and determining the replaced third group as a second negative sample; the pseudo-visual feature refers to another visual feature in the third group that has the highest similarity to the visual feature;

For the rest groups except the first group, for each graphic feature pair included in one half of the rest groups, the matched text feature is replaced by another text feature with the highest similarity with the text feature, and for each graphic feature pair included in the other half of the rest groups, the matched image feature is replaced by another image feature with the highest similarity with the image feature, so that the negative sample group is obtained.

And S605, splicing the positive sample, the first negative sample and the second negative sample to be used as input of image-text matching.

After the image-text matching result is obtained in any mode, the cross entropy loss function calculation can be performed on the classification label prediction result output by image-text matching and the real classification label, and the network parameters of the feature extraction network are adjusted through a back propagation algorithm so as to minimize the loss function.

It will be appreciated that for the two embodiments of obtaining the image-text matching result, at least one of the two embodiments may be used to obtain the image-text matching result to complete training of the feature extraction network, i.e. only one of the two embodiments may be used or both may be combined.

In the embodiment of the invention, the image-text characteristics corresponding to the first dynamic data sample are divided into different groups, the pseudo text characteristics and the pseudo visual characteristics are introduced to serve as negative samples, the pseudo text characteristics and the pseudo visual characteristics with higher similarity serve as negative samples, the network parameters of the characteristic extraction network are adjusted based on the image-text matching result of the samples, the characteristic extraction network can be guided to accurately learn the association between the text characteristics and the visual characteristics, the network can better capture the semantic relationship between the text and the image, the expression capacity of the characteristics is improved, and the generalization capacity of the model can be effectively improved and the overfitting risk is reduced by introducing different types of negative samples.

In some alternative embodiments, for the feature extraction network according to the foregoing step S204, the classifier is semi-supervised trained using the first dynamic data sample and the second dynamic data sample without classification labels, before the semi-supervised training, the classifier may be initialized in advance, and after the initialization is completed, the dynamic data sample with classification labels may be initialized by using random initialization or preliminary training, as shown in fig. 7, so as to implement the semi-supervised training by the following steps:

s701, inputting visual features and text features of the second dynamic data sample extracted by the feature extraction network into the classifier, and determining the classification output by the classifier as a pseudo classification label of the second dynamic data sample;

based on the second dynamic data sample not having a classification tag, a pseudo classification tag is generated using the classifier on the predicted result of the second dynamic data sample.

The classification category with the highest probability output by the classifier can be determined as the pseudo classification label of the second dynamic data sample by referring to the pseudo classification label marking method shown in the following formula (5).

Where f_i (x) represents the probability that the output of the classifier predicts belonging to each class label.

Or for the generation of the pseudo tag, a self-integration method can be used, and the same model is integrated under different iteration periods and different data enhancement and regularization conditions to construct a better pseudo tag.

S702, training the classifier by using the first dynamic data sample and a second dynamic data sample with a pseudo classification label, and adjusting network parameters of the classifier by minimizing a preset classifier loss function; the classifier loss function comprises a first loss function of a first dynamic data sample and a second loss function of a second dynamic data sample; the second loss function is correspondingly provided with a control weight; the control weights are used to indicate the weights of the second dynamic data samples for network parameter adjustment of the classifier.

After obtaining a second dynamic data sample with a pseudo-classification label, inputting the first dynamic data sample and the second dynamic data sample into the feature extraction network to obtain corresponding visual features and text features, splicing the visual features and the text features, and then sending the spliced visual features and text features into the classifier, respectively calculating a first loss function and a second loss function according to the output result of the classifier, and adjusting network parameters of the classifier by minimizing the loss function.

In this embodiment, the idea of cross entropy regularization may be applied, where the second dynamic data sample without a classification label is converted into a regularization term of a classifier loss function, that is, data with a pseudo classification label is directly regarded as data with a classification label, the cross entropy is used to evaluate the error size, and the error of the two parts of data is combined by controlling the parameter of the weight. For example, the classifier loss function may be as shown in equation (6):

Wherein n is the batch size of the first dynamic data sample with the classification label, n' is the batch size of the second dynamic data sample with the pseudo classification label, C is the output class, and alpha (t) is the control weight corresponding to the data sample of the unclassified label. The above method can be divided into two parts, the left side of the plus sign is a first loss function with classified tag data, the right side of the plus sign is a second loss function with pseudo tag data, and the specific loss function used is a pseudo cross entropy loss function, as shown in the formula (7):

L(y_i,f_i)＝-y_ilogf_i-(1-y_i)log(1-f_i) (7)

The control weight determines the effect of the cost of the unlabeled data on the network update in the module; when the classifier is initially introduced, the classification accuracy of the classifier is low, and the generated pseudo classification labels also have larger noise, so that the control weight is required to be controlled to be 0. In the training process, the classification accuracy is continuously improved, and the control weight can be adjusted according to the classification accuracy. For example, as shown in the following formula (8), the control weight and the classification accuracy are in a linear change relation:

the variable T represents the classification accuracy of the classifier, the T1 and T2 are a first threshold and a second threshold of the classification accuracy, the first threshold is smaller than the second threshold, α_f is a preset super parameter, and the value of the super parameter may be 0.5.

That is, in the case that the classification accuracy is smaller than the first threshold, the control weight takes a value of 0; under the condition that the classification accuracy is larger than or equal to a first threshold value and smaller than a second threshold value, the control weight value linearly increases along with the classification accuracy, and the maximum value is a preset super-parameter; and under the condition that the classification accuracy is greater than or equal to a second threshold value, the control weight takes the value as the super parameter.

In the embodiment of the invention, a second dynamic data sample without a classification label is converted into data with a pseudo classification label by utilizing a cross entropy regularization idea, so that the data quantity required to be marked manually is reduced, the labor cost and the time cost are saved, the scale of a training data set is enlarged, the data with the pseudo classification label is directly regarded as the data with the classification label, the cross entropy is used for evaluating the error size, the influence degree of the pseudo classification label data is adjusted by controlling a weight parameter, the loss proportion of the pseudo label data is continuously enhanced, the classifier is guided to improve the training accuracy, the noise influence is reduced, and a classifier with higher relative accuracy can be trained by less marking data; meanwhile, the training mode of combining the label data and the label-free data can reduce the risk of overfitting and improve the accuracy and the robustness of classification.

In some alternative embodiments, after the training in the step S702 is finished, after performing the performance evaluation of the classifier by using the evaluation set, if there is a performance evaluation index that does not meet the set index threshold, the method may further include the following additional training steps, where the steps include:

(1) Determining a target classification category of which the classification accuracy is lower than a set threshold value and any performance evaluation index does not meet the set index threshold value by using an evaluation set; (2) According to the data sources of the evaluation set, respectively determining the classification accuracy of each data source on the classifier; (3) Acquiring a third dynamic data sample which accords with the data source with the highest classification accuracy corresponding to the data source, and merging the third dynamic data sample into the first dynamic data sample; the third dynamic data sample has a classification tag; (4) And adjusting the loss function of the classifier according to the target classification category, and alternately training the classifier according to the first dynamic data sample and the second dynamic data sample.

That is, for each classification category whose classification accuracy is lower than the set threshold, the performance evaluation index of the category is calculated for each category, where the performance evaluation index may include at least one of an accuracy rate, a recall rate, and an F1 score, while the data sources of the evaluation set are used as different dimensions to calculate the overall accuracy of each data source on the classifier, where the data sources may include in-station (specifying data storage system) supervised data (evaluation data of the tag to be classified), off-station (external acquisition) supervised data, and on-site and off-site combined supervised data.

The multi-dimensional performance evaluation index based on the classifier does not reach the standard, the training set is resampled, the high classification accuracy corresponding to the data source indicates that the classification distinction of the data under the data source is better, therefore, the first dynamic data sample with the classification label is complemented by introducing the data of the data source type with the highest overall accuracy, meanwhile, the classification category with the classification accuracy lower than the set threshold is taken into consideration as a part of the loss function in the classifier training, and the classifier is complemented based on the complemented first dynamic data and second dynamic data sample, so that the classification accuracy is further improved.

In the embodiment of the invention, the classification effect of the classifier is evaluated by adjusting the classification category distribution and combining the multi-source multi-dimension index, the classifier is further trained by resampling according to the evaluation result, and the classification category with relatively poor performance is taken into consideration as a part of the loss function in the training process, so that the classification accuracy is higher and more accords with the real situation.

Based on the training method of the dynamic data classification network in the foregoing embodiment, the present invention further provides a dynamic data classification method, which is applicable to various classification scenarios of data to be classified including visual data and text data, as shown in fig. 8, and the classification method may include the following steps:

s801, acquiring dynamic data to be classified; the dynamic data includes visual data and text data;

S802, inputting the dynamic data to be classified into a dynamic data classification network, and obtaining a prediction classification result output by the dynamic data classification network; the dynamic data classification network is obtained by training the training method of the dynamic data classification network;

s803, determining a classification result of the dynamic data to be classified according to the prediction classification result.

In the embodiment of the invention, the classifying network trained based on the training method of the dynamic data classifying network can accurately extract and characterize the multi-mode characteristics of the dynamic data, the dynamic data to be classified is classified by using the classifying network, and the dynamic data can be rapidly and accurately classified, so that the requirements of information auditors and users are better met, the user experience and satisfaction are improved, and more effective support is provided for applications such as information audit.

The execution order of the steps in the above-described illustrated flow is not limited to the order in the flow chart. Furthermore, the descriptions of the individual steps may be implemented in the form of software, hardware, or a combination thereof, for example, those skilled in the art may implement them in the form of software code, or may be computer-executable instructions capable of implementing the logic functions corresponding to the steps. When implemented in software, the executable instructions may be stored in memory and executed by a processor in the system.

Exemplary apparatus

Having described the method of an exemplary embodiment of the present invention, an apparatus of an exemplary embodiment of the present invention is described next with reference to fig. 9.

The implementation process of the functions and roles of each module in the following device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein. For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments.

FIG. 9 schematically illustrates a training apparatus of a dynamic data classification network comprising visual data and text data, the dynamic data classification network comprising a feature extraction network and a classifier, according to an embodiment of the invention; the feature extraction network comprises a visual feature extraction network and a text feature extraction network; the device comprises:

a sample acquisition module 910, configured to acquire a first dynamic data sample having a classification tag;

The feature extraction module 920 is configured to perform feature extraction on the first dynamic data sample by using the visual feature extraction network and the text feature extraction network to obtain a corresponding visual feature and a corresponding text feature, where the visual feature extraction network and the text feature extraction network share network parameters of a self-attention layer;

The feature extraction network training module 930 is configured to obtain an image-text matching result of the visual feature and the text feature, and adjust a network parameter of the feature extraction network according to the image-text matching result and the classification label;

and the classifier training module 940 is configured to perform semi-supervised training on the classifier according to the feature extraction network by using the first dynamic data sample and a second dynamic data sample without a classification label.

In some alternative embodiments, the feature extraction module is specifically configured to:

In some alternative embodiments, each layer of the visual feature extraction network includes at least a first self-attention layer, a cross-attention layer, and a first feedforward neural network; the parameters of the first self-attention layer include a query index; the feature extraction module, when configured to input the pre-extracted visual feature to the visual feature extraction network to obtain the corresponding visual feature, may include:

In some alternative embodiments, the output results of the first self-attention layer include an output tensor determined based on the query index; the output tensor is determined according to the query vector, the key vector and the value vector corresponding to the query index by using an attention weight calculation formula.

In some alternative embodiments, the first self-attention layer is provided with a plurality of attention heads; the output tensor is obtained by splicing the outputs of the plurality of attention heads corresponding to the query index according to the dimension of the attention heads and processing the splicing result based on a linear change matrix.

In some alternative embodiments, the feature extraction module, when used to obtain pre-extracted visual features of the target data using a comparative language-image pre-training model, may comprise:

In some alternative embodiments, the feature extraction network training module is specifically configured to:

determining the first packet as a positive sample;

In some alternative embodiments, the classifier training module is specifically configured to:

In some alternative embodiments, the control weights are adjusted according to the classification accuracy of the classifier; under the condition that the classification accuracy is smaller than a first threshold value, the control weight value is 0; under the condition that the classification accuracy is larger than or equal to a first threshold value and smaller than a second threshold value, the control weight value linearly increases along with the classification accuracy, and the maximum value is a preset super-parameter; and under the condition that the classification accuracy is greater than or equal to a second threshold value, the control weight takes the value as the super parameter.

In some alternative embodiments, the classifier training module further comprises:

In some alternative embodiments, the sample acquisition module is specifically configured to:

Referring to fig. 10, the present invention further provides a dynamic data classification device, which includes:

a dynamic data acquisition module 1010, configured to acquire dynamic data to be classified; the dynamic data includes visual data and text data;

The classification module 1020 is configured to input the dynamic data to be classified into a dynamic data classification network, and obtain a prediction classification result output by the dynamic data classification network; the dynamic data classification network is obtained by training the training method of the dynamic data classification network;

And the classification result obtaining module 1030 is configured to determine a classification result of the dynamic data to be classified according to the prediction classification result.

Exemplary Medium

Having described the method and apparatus of an exemplary embodiment of the present invention, a readable storage medium of an exemplary embodiment of the present invention is described next with reference to fig. 11.

In the present exemplary embodiment, the above-described method may be implemented by a program product, such as a portable compact disc read only memory (CD-ROM) and including program code, and may be run on a device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the C programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Exemplary computing device

Having described the method, apparatus, and medium of the exemplary embodiments of the present invention, a computing device of the exemplary embodiments of the present invention is next described with reference to fig. 12.

The computing device 1200 shown in fig. 12 is merely an example, and should not be taken as limiting the functionality and scope of use of embodiments of the invention.

As shown in fig. 12, the computing device 1200 is in the form of a general purpose computing device. Components of computing device 1200 may include, but are not limited to: the at least one processing unit 1201, the at least one memory unit 1202, and a bus 1203 connecting the different system components (including the processing unit 1201 and the memory unit 1202).

Bus 1203 includes a data bus, a control bus, and an address bus.

The storage unit 1202 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 12021 and/or cache memory 12022, and may further include readable media in the form of nonvolatile memory, such as Read Only Memory (ROM) 12023.

The storage unit 1202 may also include a program/utility 12025 having a set (at least one) of program modules 12024, such program modules 12024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The computing device 1200 may also communicate with one or more external devices 1204 (e.g., keyboard, pointing device, etc.).

Such communication may occur through an input/output (I/O) interface 1205. Moreover, computing device 1200 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1206. As shown in fig. 12, network adapter 1206 communicates with other modules of computing device 1200 via bus 1203. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computing device 1200, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

It should be noted that while in the above detailed description reference is made to several units/modules or sub-units/modules of a training and dynamic data classification device of a dynamic data classification network, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A training method of a dynamic data classification network, which is characterized in that the dynamic data comprises visual data and text data, and the dynamic data classification network comprises a feature extraction network and a classifier; the feature extraction network comprises a visual feature extraction network and a text feature extraction network; the method comprises the following steps:

acquiring a first dynamic data sample with a classification label;

2. The method according to claim 1, wherein the feature extraction of the first dynamic data sample by using the visual feature extraction network and the text feature extraction network to obtain corresponding visual features and text features, respectively, includes:

3. The method of claim 2, wherein each layer of the visual feature extraction network comprises at least a first self-attention layer, a cross-attention layer, and a first feed-forward neural network; the parameters of the first self-attention layer include a query index; the inputting the pre-extracted visual feature into the visual feature extraction network to obtain the corresponding visual feature includes:

4. The method according to claim 1, wherein the obtaining the image-text matching result of the visual feature and the text feature comprises:

5. The method according to claim 4, wherein the method further comprises:

determining the first packet as a positive sample;

6. A method of dynamic data classification, the method comprising:

Inputting the dynamic data to be classified into a dynamic data classification network, and obtaining a prediction classification result output by the dynamic data classification network; wherein the dynamic data classification network is trained by the training method of the dynamic data classification network according to any one of claims 1-5;

7. A training device of a dynamic data classification network, wherein the dynamic data comprises visual data and text data, and the dynamic data classification network comprises a feature extraction network and a classifier; the feature extraction network comprises a visual feature extraction network and a text feature extraction network; the device comprises:

8. A dynamic data classification device, the device comprising:

the classification module is used for inputting the dynamic data to be classified into a dynamic data classification network and obtaining a prediction classification result output by the dynamic data classification network; wherein the dynamic data classification network is trained by the training method of the dynamic data classification network according to any one of claims 1-5;

9. A readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any one of claims 1 to 5 or the method of claim 6.

10. A computing device, comprising: a processor, a memory;

The memory is used for storing a computer program;

The processor is configured to execute the method according to any one of claims 1 to 5 or the method according to claim 6 by invoking the computer program.