Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
As shown in FIG. 1, the method for analyzing the group behavior of the multi-modal information fusion and dynamic update comprises the following steps:
s1, information integration and dynamic updating are carried out, multi-mode information is obtained, preprocessing is carried out on the information, characteristics are extracted, representation conversion, alignment and fusion are carried out on the characteristics, and a stream processing frame is adopted for carrying out real-time processing and dynamic updating on the multi-mode information;
As shown in fig. 2, in this embodiment, multi-mode information is obtained and preprocessed, specifically:
Collecting multimodal data from a plurality of data sources including text, images, audio and video, text data in the form of Ti, image data in the form of Ii, audio data in the form of Ai, and video data in the form of Vi;
The method comprises the steps of cleaning and standardizing different types of data, wherein the text data is processed by the following steps:
Dividing words from a text, removing stop words, adding special marks, converting words into corresponding IDs, and regenerating an attention mask, namely generating a vector with the same length as a word ID sequence, indicating which positions are valid words and which are filled, finally generating a segment ID, marking words belonging to a first sentence and words belonging to a second sentence, and finally outputting three vectors, namely Tiid=(id1,id2…idsequence_length), the attention mask and the segment ID, wherein Tiid is the generated vector with the same length as the word ID sequence, and sequence_length is the word sequence length;
The processing procedure of the image data is as follows:
The method comprises the steps of adjusting the size of an image, wherein the output shape is H multiplied by W multiplied by C, H and W are the sizes of target pictures, and C is the number of channels;
according to the formula:
normalizing the adjusted image, wherein,For an image resized to an image, μ is the mean of the image dataset, σ is the standard deviation, and the final output form is the following matrix:
The processing process of the audio data is as follows:
the audio spectrum is obtained using a fast fourier transform according to the following formula:
Wherein,Is an audio signal in the frequency domain, f is the frequency, and N is the number of sampling points;
Denoising using spectral subtraction using the following formula:
Wherein,Is the noise spectrum estimated in the mute section;
Finally obtainThe form is as follows:
Aidenoised=(a1denoised,a2denoised…andenoised)
Each element corresponds to an amplitude value of the denoised audio signal at a time k, and n represents the total length of the denoised audio signal;
The processing process of the video data is as follows:
Video decoding is carried out on the video, the video stream is decoded into a plurality of independent frames, a frame sequence Videcoded is output, the shape is NXHXW XC, wherein N is the number of frames, H and W are the height and width of the frames, and C is the number of color channels;
Frame sampling, namely sampling key frames from a frame sequence, and outputting a sampled frame sequence Visampled;
performing frame pretreatment, and performing size adjustment and normalization operation on each frame to enable the frame to be suitable for model input;
Finally, the preprocessed frame sequence Vipreprocessed is output, and the output form is as follows:
in this embodiment, the text feature extraction is specifically:
Extracting semantic features from text data by using natural language processing NLP, inputting the processed text into a BERT model, wherein the BERT input comprises word IDs, attention masks and segmentation IDs, the input word ID sequence Tiid is firstly converted into word vectors with fixed dimensions through an embedding layer, the shape E epsilon Rsequence_length×hidden_size and the hidden_size are hidden layer dimensions, and then each input vector is calculated in the model through a multi-head self-attention mechanism, and the dependency relationship between the input vector and other positions in the sequence is calculated according to the following formula:
Wherein Q, K, V are respectively a matrix of queries, keys and values, obtained by different linear transformations, dk is a scaling factor, and then the multiple self-attentive outputs are spliced together:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
Wherein Concat () is a concatenation function, headi=Attention(Qi,Ki,Vi), and finally the vector of each position passes through a feedforward neural network to further process its semantic information:
FFN(x)=max(0,xW1+b1)W2+b2
Wherein W1,W2 is a learnable weight matrix, b1,b2 is a bias term, and x is the input of the last step;
And carrying out residual connection on the output of each layer and the input of each layer, and carrying out layer normalization, wherein the formula is as follows:
LayerNorm(x+SubLayer(x))
Finally outputting two vectors, the first is a sequence feature matrix, a three-dimensional tensor provides a vector representation of each word in the sequence, and the second is a sentence feature vector, a two-dimensional tensor provides an overall vector representation of the whole sequence;
The extraction of image features is specifically as follows:
the method comprises the steps of extracting visual features from image data by utilizing a convolutional neural network CNN deep learning method, inputting a preprocessed image, and firstly performing convolutional operation:
Wherein Fij is a pixel value of the convolution feature map, Wmnc is a weight of the convolution kernel, b is an offset term, Hf and Wf are a height and a width of the convolution kernel, respectively, and C is a channel number of the input image;
Outputting a convolution characteristic diagram F with the shape ofWherein Nf is the number of convolution kernels, representing the depth of the output feature map, i.e. the number of channels;
processing by a ReLU nonlinear activation function to introduce nonlinear characteristics:
Aij=ReLU(Fij)=max(0,Fij)
Wherein Aij is a pixel value in the activated feature map, and the shape of the activated feature map A is the same as that of the convolution feature map;
The feature map is downsampled by using the maximum pooling, the space dimension of the feature map is reduced, and important feature information is reserved at the same time:
Pij=max(A(i,j),A(i+1,j),A(i,j+1),A(i+1,j+1))
the feature map P after pooling has the shape ofWherein H '' x W '' is the height and width after pooling;
after multi-layer convolution, activation and pooling, the extracted feature map is typically flattened into a one-dimensional vector, i.e., Pflat =flat (P), and input to the fully connected layer, further combining and extracting global features:
z=Wfc·Pflat+bfc
Wherein Wfc is the weight matrix of the full connection layer, and bfc is the bias term;
Finally outputting the feature vector of the image;
the audio feature extraction is specifically as follows:
Extracting sound characteristics from the audio data by adopting a Mel Frequency Cepstrum Coefficient (MFCC), and performing short-time Fourier transform on the processed audio:
Where m is the frame index, f is the frequency, N is the number of samples per frame, hop_size is the frame shift, and the frequency domain signal is outputThe shape is a two-dimensional matrix, and the two-dimensional matrix contains spectrum information of a plurality of frames;
applying mel frequency scale:
Wherein, f0 is the normal frequency, fmel is the Mel frequency, and output Mel frequency spectrum Aimel(m,fmel which represents the frequency spectrum information of the audio signal on Mel scale;
Calculating a mel frequency cepstral coefficient:
where c is the coefficient index of the MFCC, taking the first 13 coefficients, i.e., c=13;
the final extracted MFCC features are represented as a two-dimensional array;
The video feature extraction is specifically as follows:
Extracting key frame characteristics and dynamic information from video data by combining image characteristic extraction and action recognition technology, firstly extracting static characteristics of the processed video by using CNN to obtain a static characteristic direction z of each frame, wherein the shape is z epsilon Rfeature_size, then extracting dynamic characteristics by using LSTM, for each time step, sequentially processing input characteristics zt by using LSTM, and combining hidden state ht-1 and cell state Ct-1 of a preamble time step to generate hidden state ht and cell state Ct of the current time step, wherein the processing of each time step is as follows:
ht=ot·tanh(Ct)
finally, static characteristics and dynamic characteristics are obtained.
In this embodiment, the feature is subjected to representation conversion, alignment and fusion, specifically:
converting the extracted features into unified vector representation so as to facilitate comparison and fusion between different modality data;
The method comprises the steps of synchronizing data of different modes in time and space through a time stamp, an event mark or other alignment modes, ensuring the relevance between the data, aligning the modes with time dimension, namely video and audio, according to the minimum time step in the time aspect, aligning the modes based on space anchor points in the space aspect, and normalizing by using zero mean unit variance:
Wherein μX is the mean of the feature vectors, σX is the standard deviation;
The final output normalized feature vector Tnorm、Inorm、Anorm、Vnorm;Tnorm、Inorm、Anorm、Vnorm is the feature vector of the corresponding four modes output after normalization by using zero mean unit variance;
The feature fusion specifically comprises the following steps:
Splicing the feature vectors of different modes together to form a comprehensive feature vector Fconcat∈Rbatch_size×(dimT+dimI+dimA+dimV), wherein the dimension of the new feature vector after splicing is the sum of the dimensions of all modes, the batch_size is the number of data samples processed during each training or reasoning, the specific numerical value is determined according to the actual situation, and the meaning of dimT, dimI, dimA, dimV is the dimension of the feature vector of four modes;
according to the importance of different modes, the weighted summation is carried out on each characteristic vector:
Fweighted=wT·Tnorm+wI·Inorm+wA·Anorm+wV·Vnorm
the weighted comprehensive feature vector Fweighted∈Rbatch_size×(dimT+dimI+dimA+dimV) forms a comprehensive feature, and the weight is determined according to experience and a use scene;
Finally, performing dimension reduction processing, namely performing dimension reduction on the high-dimension features by using PCA to reduce calculation complexity and noise, wherein firstly, input data is subjected to standardization processing:
where μ is the mean of each feature and σ is the standard deviation, normalized data is used to calculate a covariance matrix describing the linear correlation between features:
Outputting a covariance matrix C, and carrying out eigenvalue decomposition on the covariance matrix to obtain eigenvalues and corresponding eigenvectors, wherein the eigenvalues represent the variance of the data in the direction of each eigenvector:
Cvi=λivi
Where λi is the ith eigenvalue and vi is the corresponding eigenvector;
And outputting a eigenvalue vector lambda= [ lambda1,λ2,…,λdim ] and an eigenvector matrix V epsilon Rdim×dim, and selecting eigenvectors corresponding to the first n largest eigenvalues as principal components according to the size ordering of the eigenvalues:
Vpca=[v1,v2,…,vn]
where n is the number of principal components selected, determined from the cumulative variance contribution;
projecting the standardized data onto the selected principal component space to obtain a feature representation after dimension reduction:
Fpca=FstdVpca
And finally outputting a matrix of low-dimensional characteristic representations, wherein the matrix comprises the representation of the projection of the original high-dimensional data to the principal component directions, and the data of each sample has a projection value in the principal component directions, and the projection values form new characteristics Fpca∈Rbatch_size×n after the dimension reduction, wherein n is the target dimension after the dimension reduction.
In this embodiment, a stream processing framework is adopted to perform real-time processing and dynamic update of multi-mode information, specifically:
real-time processing and dynamic updating of the multi-modal information is performed using a stream processing framework APACHE FLINK;
Carrying out real-time cleaning and feature extraction on the accessed data stream through the Flink, and assuming that the weight of the model is W (t), expressing the dynamically updated model weight as:
The data streams are aggregated and analyzed in real time through the Flink, the Flink aggregates the data streams in a specific time window through window operation, and the aggregation characteristics in the time window [ t, t+delta t ] are expressed as follows:
generating real-time analysis results by classifying, clustering or other analysis operations on the aggregated features;
the Flink iterative operation continuously optimizes the model or the processing strategy, and updates the parameters of the model or adjusts the feature extraction method through the feedback analysis result, so that the system performs better when processing the next batch of data, and the iterative updating process is expressed as follows:
Wherein,Is the feature after the kth iteration, and α is the update step.
S2, introducing a self-learning mechanism and a self-optimizing mechanism;
as shown in fig. 3, in this embodiment, the introduction of the self-learning mechanism specifically includes:
marking and deducing new data automatically by using trained BERT model, CNN and LSTM, and generating label or classification result automatically;
The self-updating of the model is specifically as follows:
The method comprises the steps of updating model parameters when new data arrives through an online gradient descent method, adapting to dynamic changes of the data, updating weights of the new text data by using online learning expansion of BERT when the new text data arrives, inputting the new data Xnew, a real label ynew and a current model parameter W, and updating a formula through online gradient descent:
wherein, eta is the learning rate,Is the gradient of the loss function with respect to the model parameters. Outputting updated model parameters Wnew;
The learning feedback mechanism is specifically as follows:
Comparing the predicted label with the real label, adjusting the self-adaptive learning rate and the self-adaptive loss function weight, and finally obtaining the adjusted learning rate, loss function or other super parameters;
Input model predictive labelsAnd a real label ytrue, according to the prediction error, adjusting the learning rate and the weight of the loss function, wherein the error calculation formula is as follows:
the learning rate adjustment formula is:
ηnew=ηold·(1-α·ε2)
wherein α is an adjustment coefficient;
and (3) carrying out weight adjustment of a loss function:
λnew=λold+β·ε
Wherein β is an adjustment coefficient;
The adjusted learning rate ηnew and the loss function weight λnew are output.
In this embodiment, the introduction of the self-optimization mechanism specifically includes:
super-parameter optimization, wherein the super-parameter optimization aims at finding out a super-parameter combination which can optimize the performance of the model in a given super-parameter space;
Defining super parameters to be optimized and possible value ranges { theta1,θ2,...,θn }, simulating natural evolution to select optimal parameters by using a genetic algorithm, and repeating the processes of selection, crossing and mutation until convergence conditions or a preset maximum algebra are reached;
Model architecture optimization, specifically:
Defining the structural parameters of the neural network, evaluating the neural network architecture Ai randomly or by strategy selection from the search space S:
Wherein,The method is based on a model of a framework Ai, G (Ai) is average loss on a verification set, and the framework A* with the best performance on the verification set is selected as a final optimal framework through multiple sampling and evaluation;
the computing resource optimization is specifically as follows:
The method comprises the steps of firstly inputting the current computing resource state and task demands, including the utilization rate of a CPU and a GPU, the utilization rate of a memory and the priority of a task, adopting a task scheduling algorithm and resource monitoring and allocation to adjust resource allocation, monitoring the utilization rate of the resource in real time, dynamically adjusting the resource allocation to deal with the change of a system and the execution condition of the task, and finally outputting the optimized resource allocation and task scheduling strategy to improve the overall efficiency and response speed of the system.
S3, generating a group portrait, namely extracting demographic characteristics, geographical position distribution, behavioral and emotional psychological characteristics, social relations, social media and economic characteristics from the multi-mode data to generate the group portrait;
as shown in fig. 4, in this embodiment, the demographic feature extraction is specifically:
extracting information including, but not limited to, age, gender, and occupation from text data using a NER model based on conditional random fields, the formula is as follows:
Wherein y is a tag sequence, namely, extracted demographic characteristics, x is an input sequence, namely, text data, fk (y, x) is a characteristic function, lambdak is a weight, Z (x) is a normalization factor, and probability distribution normalization is ensured;
The extracted demographic characteristics generate a demographic characteristic vector Di containing the numerical values of various information;
The geographic position distribution is extracted specifically as follows:
Converting the position described by the text into geographic coordinates by using a geographic coding technology, calculating the square sum WCSS in the minimized cluster by using a K-means clustering method, and identifying a main active area;
The extraction of behavioral and emotional psychological characteristics is specifically as follows:
user behavior is analyzed using time series analysis and frequent pattern mining techniques, and user behavior patterns are analyzed and predicted using an autoregressive moving average model ARIMA:
xt=c+φ1xt-1+φ2xt-2+...+φpxt-p+∈t+θ1∈t-1+...+θq∈t-q
Where c is a constant term, phip is an autoregressive coefficient, thetaq is a moving average coefficient, Et is an error term;
Mining frequent patterns from behavior data by using an Apriori algorithm, and identifying behavior combinations which occur frequently; the identified behavior pattern and frequent behavior combination generates a behavior feature vector Bi which represents the features of the user on the behavior pattern;
Analyzing emotion tendencies in text data by using an emotion analysis model BERT, extracting main topics in the text data by using an LDA topic modeling method, analyzing attention points and attitudes of groups, and evaluating psychological characteristics of users by combining an OCEAN model based on text analysis.
The social relationship feature extraction is specifically as follows:
Constructing a graph structure, defining nodes and edges, converting social network data into the graph structure, wherein the nodes represent individuals (such as users and devices), and the edges represent relationships (such as friend relationships and communication records) among the individuals;
Identifying a tightly connected node population in a social network based on modular community detection:
Wherein Aij is an element of an adjacency matrix, which indicates whether a connection exists between nodes i and j, ki and kj are degrees of the nodes i and j, m is the total number of edges in the graph, delta (ci,cj) is an indication function, and when i and j belong to the same social area, the value is 1, otherwise, the value is 0;
the social media and economic characteristics are extracted specifically as follows:
Analyzing the media type and frequency of the user contact, analyzing the source and preference of the user acquired information, extracting the expenditure type and the amount from the consumption record, and analyzing the consumption trend;
After demographic characteristics, geographical position distribution, behavioral and emotional psychological characteristics, social relations, social media and economic characteristics are extracted, the group portraits are intuitively displayed in a bar graph, a pie chart, a histogram and a network chart mode, the behavioral trends of the groups are analyzed, potential risks and challenges are estimated, and an entry for downloading analysis reports is provided.
S4, group behavior analysis, as shown in FIG. 4, in this embodiment, the method specifically includes:
Group clustering, specifically:
The number of clusters is determined using the elbow method, and the appropriate number of clusters is determined by calculating the intra-cluster squares sum WCSS at different cluster numbers k to find the k value, i.e., the elbow point, that significantly reduces WCSS:
Wherein Cj is the j-th cluster, muj is the mass center of the cluster, clustering is carried out by using a K-means algorithm to obtain a result so as to analyze group behaviors, cluster centers provide a concise description of the overall characteristics of the cluster, intra-cluster differences represent the consistency and diversity of the characteristics of members in the group, the inter-cluster differences are measured by the distances between different cluster centers, and larger inter-cluster differences mean that the characteristics of different groups are obviously different;
time series and emotion feature analysis, specifically:
The method comprises the steps of taking time sequence features as input, dividing a group into different behavior pattern groups by using a K-means clustering algorithm, carrying out contrast analysis on the groups of different clusters, identifying feature differences of the groups, and analyzing causal relations among the time sequence features by using a method of the Granges causal test, wherein a core formula of the Granges causal test is as follows:
If betaj is obviously different from zero, Xt is considered as the Grandide cause of Yt, finally, the LSTM time sequence model is used for predicting the future behavior trend of the group, and the BERT model is used for emotion analysis to obtain emotion polarity, emotion intensity and emotion trend;
the behavior association analysis specifically comprises the following steps:
converting the text data into a form in which each line represents a transaction, each transaction containing a set of items;
Inputting an Apriori algorithm, counting the occurrence frequency of each item in all transactions, calculating the support degree, screening frequent item sets according to a minimum support degree threshold, generating new candidate item sets by using the frequent item sets, calculating the support degree of the frequent item sets, and repeating the steps of generating the candidate frequent item sets and calculating the support degree until the new frequent item sets cannot be generated;
For each frequent item set, all possible association rules are generated, and for rule a- > B, confidence is calculated:
Confidence(A->B)=Support(A∪B)/Support(A)
screening out meaningful association rules according to the minimum confidence coefficient threshold value;
abnormality detection, specifically:
Calculating the average distance between the individual behaviors and the nearest neighbors by using a K-nearest neighbor algorithm, wherein the average distance is specifically as follows:
Where Xik and Xjk are the values of data points Xi and Xj, respectively, on the ith feature and n is the dimension of the feature;
If the distance exceeds the preset threshold value, the individual is likely to be abnormal, analyzing and explaining each abnormal point, finding out the reason for the deviation from the normal state, visualizing by using a scatter diagram and a box diagram method, and obviously highlighting the abnormal behavior.
S5, outputting a visualization, outputting a group behavior prediction result in a visual mode, wherein the step comprises the following steps:
the behavior trend graph is used for displaying the time sequence change trend of group behaviors, and each broken line represents the change of different behavior modes, so that the dynamic change of the behaviors in different time periods can be observed conveniently;
A group clustering graph, which is to display a clustering result of a group by adopting a two-dimensional or three-dimensional scatter diagram, wherein different colors or shapes represent different group clusters, and a mark in the center of the cluster is used for representing representative characteristics of the clusters so as to be convenient for identifying the similarity and the difference among the groups;
the association rule diagram is used for displaying association relations between behaviors by using the network diagram, nodes represent behaviors or events, connecting lines between the nodes represent association rules of the nodes, and the thickness and the color of the lines represent the strength and the confidence of association;
The abnormal behavior detection diagram is used for displaying abnormal behaviors in the group through a scatter diagram and a box diagram, wherein the scatter diagram is used for intuitively displaying differences between abnormal points and normal behaviors, and the box diagram is used for displaying distribution conditions of data and positions of abnormal values.
It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.