Movatterモバイル変換


[0]ホーム

URL:


CN119719366A - Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis - Google Patents

Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis
Download PDF

Info

Publication number
CN119719366A
CN119719366ACN202411539107.3ACN202411539107ACN119719366ACN 119719366 ACN119719366 ACN 119719366ACN 202411539107 ACN202411539107 ACN 202411539107ACN 119719366 ACN119719366 ACN 119719366A
Authority
CN
China
Prior art keywords
text
confidentiality
data
vector
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411539107.3A
Other languages
Chinese (zh)
Inventor
张玉臣
胡浩
汪永伟
范钰丹
刘鹏程
周洪伟
纪然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University Of Chinese People's Liberation Army Cyberspace Force
Original Assignee
Information Engineering University Of Chinese People's Liberation Army Cyberspace Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University Of Chinese People's Liberation Army Cyberspace ForcefiledCriticalInformation Engineering University Of Chinese People's Liberation Army Cyberspace Force
Priority to CN202411539107.3ApriorityCriticalpatent/CN119719366A/en
Publication of CN119719366ApublicationCriticalpatent/CN119719366A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention relates to the technical field of natural language processing, in particular to a target personnel confidentiality consciousness assessment method and system based on multiple emotion analysis, which comprises the steps of constructing a confidentiality field text corpus and preprocessing data of the confidentiality field text corpus; the emotion analysis research model is constructed, and is trained by taking text word vectors as sample data, and comprises a BERT layer for mining text vectors based on text context semantics and text sentence front-back sequence relations, a BiLSTM network for carrying out bidirectional text feature extraction on the text vectors, and a fully-connected output layer for classifying and outputting text features, wherein the emotion analysis research model is used for predicting security consciousness emotion tendency types of target groups, and is used for visual display. According to the method, the security consciousness of the target person is researched and judged by mining the implicit emotion tendencies in the related texts of the security consciousness of the target person, and effective assistance can be provided for information security management.

Description

Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis
Technical Field
The invention relates to the technical field of natural language processing, in particular to a target personnel confidentiality consciousness assessment method and system based on multi-element emotion analysis.
Background
The key of the security work is to raise the security consciousness of the personnel, while the traditional security consciousness research and judgment method is in a qualitative stage, and often adopts a questionnaire method, a scale method, subjective evaluation and other methods, lacks a scientific and reasonable evaluation model and a quantitative evaluation system, and is difficult to objectively and specifically judge the security consciousness of the personnel.
Disclosure of Invention
Therefore, the invention provides a target personnel security consciousness assessment method and system based on multi-element emotion analysis, which is used for mining emotion tendencies implicit in related texts of the target personnel security consciousness based on emotion analysis and text preprocessing so as to further judge the security consciousness of the target personnel and provide effective assistance for enterprise and public information security management.
According to the design scheme provided by the invention, on one hand, a target personnel privacy awareness assessment method based on multiple emotion analysis is provided, which comprises the following steps:
Constructing a text corpus in the security domain, preprocessing data of the text corpus in the security domain to optimize text data in the corpus and convert texts into text word vectors, wherein the text word vectors are obtained by accumulating text word static vectors, text word position vectors and text sentence vectors;
Constructing an emotion analysis and judgment model, and training the emotion analysis and judgment model by using a text word vector as sample data, wherein the emotion analysis and judgment model comprises a BERT layer for mining the text vector based on text context semantics and a text sentence front-back sequential relationship, a BiLSTM network for extracting bidirectional text features from the text vector, and a fully-connected output layer for classifying and outputting the text features;
And collecting relevant texts of the privacy topics of the target crowd, predicting the privacy consciousness emotion tendency type of the target crowd by using the trained emotion analysis and judgment model, and visually displaying the prediction result.
As the target personnel security consciousness assessment method based on the multi-element emotion analysis, the invention further constructs a security field text corpus, comprising the following steps:
Collecting a secret topic related text, wherein the secret topic related text comprises a secret topic related social network comment, a secret topic related news and a secret topic self-built data set;
labeling the privacy consciousness emotion tendency type of the text related to the privacy topic, and identifying the entity in the text related to the privacy topic through analysis;
And replacing the expanded secret topic related text data by the entity, the near meaning word and the anti-meaning word, and constructing a secret topic related text based on the expanded secret topic related text data.
As the target personnel security consciousness assessment method based on the multi-element emotion analysis, the invention further comprises the steps of replacing related text data of the extended security topics by entities, near words and anti-words, and comprising the following steps:
searching for related replacement entities in response entity categories of the related texts of the confidential topics, and generating new text data;
The part of speech analysis is carried out on the related text of the secret topic, the related replacement part of speech in the related text of the secret topic is searched, the close meaning word and/or the anti-meaning word of the related replacement part of speech are used for replacement, and a corresponding new labeling is generated, so that new text data are generated.
As the target personnel security consciousness assessment method based on the multi-element emotion analysis, the method further carries out data preprocessing on a security field text corpus, and comprises the following steps:
filtering texts in a corpus, filtering and removing invalid texts, and removing redundant characters in text data, wherein the invalid texts comprise texts with empty related fields and texts irrelevant to secret topics, and the redundant characters comprise missing values, repeated values and emoticons;
Cleaning the text data by using a regular expression matching method to remove meaningless characters in the text data;
Splitting a long text and a large file exceeding a threshold value in text data into a short text and a small file with specified parts through data blocking;
Performing word segmentation on the text data by utilizing Wordpiece, masking word segmentation results by utilizing masks, adding a classification token mark at the beginning of each word segmentation sequence, and utilizing the last layer output corresponding to the classification token to represent the whole word segmentation sequence information, inserting the segmentation token mark between sentences in the same word segmentation sequence, and adding embedded vector information indicating the position of each token for each token;
and converting the text word segmentation sequences with different lengths into word vectors with standard lengths.
The method for evaluating the security consciousness of the target person based on the multi-element emotion analysis further comprises a full connection layer for converting the multi-dimensional feature vector into the appointed low-dimensional feature vector and an output layer for classifying the low-dimensional feature vector according to emotion degree and type and determining the security consciousness emotion tendency category, wherein the output layer adopts a softmax function as a classifier, calculates the probability value corresponding to each security consciousness emotion tendency category by using the softmax function and selects the category with the largest probability value as the final output emotion tendency category.
As the target personnel security consciousness assessment method based on the multi-element emotion analysis, the invention further uses the text word vector as sample data to train an emotion analysis and judgment model, and comprises the following steps:
Dividing the text word vector into a training sample set and a test sample set according to a specified proportion;
And performing iterative training on the emotion analysis and judgment model based on a preset model loss function by utilizing text word vectors in a training sample set, and performing performance evaluation, adjustment and optimization on the trained emotion analysis and judgment model by utilizing a test sample set so as to obtain the emotion analysis and judgment model with the model training effect meeting the expected requirement.
As the target personnel security consciousness assessment method based on the multi-element emotion analysis, the method further carries out visual display on the prediction result and comprises the following steps:
And carrying out statistical analysis and visual display on the security consciousness emotion tendency category prediction results of the plurality of text data reactions by using a visual display platform, wherein the visual display platform adopts a B/S architecture, and the statistical analysis comprises statistical analysis of text structures and statistical analysis of the security consciousness emotion tendency categories.
In still another aspect, the invention also provides a target personnel security consciousness assessment system based on multi-element emotion analysis, which comprises a corpus construction module, a model training module and a target studying and judging module, wherein,
The corpus construction module is used for constructing a text corpus in the security field, preprocessing data of the text corpus in the security field to optimize text data in the corpus and convert texts into text word vectors, wherein the text word vectors are obtained by accumulating text word static vectors, text word position vectors and text sentence vectors;
The model training module is used for constructing an emotion analysis and judgment model and training the emotion analysis and judgment model by taking a text word vector as sample data, wherein the emotion analysis and judgment model comprises a BERT layer for mining the text vector based on text context semantics and a text sentence front-back sequence relationship, a BiLSTM network for extracting bidirectional text features of the text vector and a fully-connected output layer for classifying and outputting the text features;
The target research and judgment module is used for collecting the related texts of the privacy topics of the target crowd, predicting the privacy consciousness emotion tendency type of the target crowd by utilizing the trained emotion analysis and judgment model, and visually displaying the prediction result.
The invention has the beneficial effects that:
according to the invention, text data including comments, opinion suggestions and the like are collected from the inside and the outside of an enterprise organization, multiple emotion analysis is performed on the basis of data processing, emotion tendencies of the personnel in the aspect of security consciousness are measured by using a BERT-BiLSTM model, and the personnel security consciousness level and possible risks and problems are excavated, so that the integral security consciousness level of the relevant personnel is judged, and assistance is provided for targeted improvement of further information security management work. And further, experimental data show that compared with the traditional emotion analysis model FastText, the BERT-BiLSTM model in the scheme has higher accuracy and practicability in text emotion analysis processing under a secret subject, has certain advantages compared with other models, and can provide effective assistance for secret work.
Drawings
FIG. 1 is a schematic diagram of a target personnel security consciousness assessment flow based on multiple emotion analysis in an embodiment;
FIG. 2 is a text data expansion flow diagram in an embodiment;
FIG. 3 is a schematic diagram of an emotion analysis and judgment model in an embodiment;
FIG. 4 is a schematic diagram of BERT layer structure in the embodiment;
FIG. 5 is a schematic diagram of a transducer structure in an embodiment;
FIG. 6 is a diagram of a BiLSTM network architecture in an embodiment;
FIG. 7 is a schematic illustration of a secret awareness assessment algorithm in an embodiment;
FIG. 8 is a visual presentation of emotion analysis of a plurality of text data in an embodiment;
fig. 9 is an example of experimental dataset fragments in an embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.
The key of the security work is to raise the security consciousness of the personnel, while the traditional security consciousness research and judgment method is in a qualitative stage, often adopts a questionnaire method and a scale method, lacks a scientific evaluation model and a quantitative evaluation system, and is difficult to objectively judge the security consciousness of the personnel. Therefore, the embodiment of the invention provides a target personnel privacy awareness assessment method based on multi-element emotion analysis, which specifically comprises the following steps:
Constructing a text corpus in the security domain, preprocessing data of the text corpus in the security domain to optimize text data in the corpus and convert texts into text word vectors, wherein the text word vectors are obtained by accumulating text word static vectors, text word position vectors and text sentence vectors;
Constructing an emotion analysis and judgment model, and training the emotion analysis and judgment model by using a text word vector as sample data, wherein the emotion analysis and judgment model comprises a BERT layer for mining the text vector based on text context semantics and a text sentence front-back sequential relationship, a BiLSTM network for extracting bidirectional text features from the text vector, and a fully-connected output layer for classifying and outputting the text features;
And collecting relevant texts of the privacy topics of the target crowd, predicting the privacy consciousness emotion tendency type of the target crowd by using the trained emotion analysis and judgment model, and visually displaying the prediction result.
Referring to fig. 1, a process for realizing visual analysis from original text collection to final result includes a text collection stage, obtaining social network comments aiming at secret topics, secret topic related news and self-built data set text, performing comprehensive processing to form a secret field text corpus, and taking the text in the corpus as input of a data preprocessing stage. The data preprocessing stage can be generally divided into two tasks, namely optimizing text data in a language library through invalid text rejection and data cleaning, so as to improve the subsequent processing efficiency and accuracy, and converting the text into a special word vectorization form with labels which can be directly input as a next layer through word segmentation, filling, word embedding and vectorization. In the emotion analysis and judgment stage, vector representation containing context text semantic information is obtained through a BERT pre-training model, an output [ CLS ] is selected to be used as an input end of a BiLSTM network model for learning, the output is transmitted to an output layer in a visualization stage after passing through a full-connection layer, interaction is carried out in a visualization part, and visual display of a classification task result is completed.
Wherein, the construction of the text corpus in the security domain can be designed to comprise:
Collecting a secret topic related text, wherein the secret topic related text comprises a secret topic related social network comment, a secret topic related news and a secret topic self-built data set;
labeling the privacy consciousness emotion tendency type of the text related to the privacy topic, and identifying the entity in the text related to the privacy topic through analysis;
And replacing the expanded secret topic related text data by the entity, the near meaning word and the anti-meaning word, and constructing a secret topic related text based on the expanded secret topic related text data.
In the embodiment of the scheme, since the data information about the confidentiality aspect in the network is less, a part of text data is derived from news texts obtained from social network comment texts, public numbers and media, and on the other hand, the text data is derived from self-built confidentiality topic text data, and in order to adopt more data, the expansion of the data is realized through near-anticonsite word conversion and entity conversion respectively.
Specifically, by replacing the extended privacy topic related text data with entities, paraphrasing and anticonyming words, it can be designed to contain:
searching for related replacement entities in response entity categories of the related texts of the confidential topics, and generating new text data;
The part of speech analysis is carried out on the related text of the secret topic, the related replacement part of speech in the related text of the secret topic is searched, the close meaning word and/or the anti-meaning word of the related replacement part of speech are used for replacement, and a corresponding new labeling is generated, so that new text data are generated.
As shown in fig. 2, the method comprises the steps of performing entity recognition and part-of-speech analysis on the result of the original tagged text word segmentation, searching for a replaced entity in the entity category of the response and generating a new text, replacing different parts-of-speech by searching for a near meaning word and an anti-meaning word, generating a new text tag by matching with the original tag, and further approving the new tagged text data by utilizing a manual checking mode. Table 1 gives an example of corpus replacement.
Table 1 corpus replacement example
Original corpusDerived corpus
I even though the company is slightly sensitive and does not want to seeI even though the company does not want to see with a little privacy
These prescriptions in the unit feel goodThese prescribed sensations in the unit are also poor
The data preprocessing of the text corpus in the security domain can be designed to include:
filtering texts in a corpus, filtering and removing invalid texts, and removing redundant characters in text data, wherein the invalid texts comprise texts with empty related fields and texts irrelevant to secret topics, and the redundant characters comprise missing values, repeated values and emoticons;
Cleaning the text data by using a regular expression matching method to remove meaningless characters in the text data;
Splitting a long text and a large file exceeding a threshold value in text data into a short text and a small file with specified parts through data blocking;
Performing word segmentation on the text data by utilizing Wordpiece, masking word segmentation results by utilizing masks, adding a classification token mark at the beginning of each word segmentation sequence, and utilizing the last layer output corresponding to the classification token to represent the whole word segmentation sequence information, inserting the segmentation token mark between sentences in the same word segmentation sequence, and adding embedded vector information indicating the position of each token for each token;
and converting the text word segmentation sequences with different lengths into word vectors with standard lengths.
The data preprocessing stage mainly comprises six parts of invalid text rejection, data cleaning, sentence word segmentation, filling, word embedding and vectorization. The invalid text rejection refers to screening texts in a corpus extracted in the previous stage, filtering and rejecting texts with empty related fields and texts irrelevant to secret topics, and meanwhile, removing missing values, repeated values and expression symbols, avoiding problems caused by program operation, influencing training and testing results of a model, further optimizing data internal structures, and improving model processing efficiency and accuracy. The data cleaning mainly adopts a regular expression matching method, and aims to remove HTML tag characters, illegal unicode characters, nonsensical characters such as </SUB >, and the like in the crawled data so as to optimize the data format and remove nonsensical characters. The sentence word segmentation part comprises two steps, namely, data are segmented, long texts and large files are split into short texts and small files and are divided according to a certain number of parts, and a certain improvement effect can be achieved on the subsequent training data generation process. The word embedding part aims at dividing and converting a single sentence into token representation which is used as the canonical input of the BERT model. Most of the semantics in Chinese are expressed by words, so that a WWM mechanism, which is Chinese vocabulary information, needs to be introduced for a Chinese MLM task. The method comprises the steps of firstly, carrying out word segmentation by Wordpiece, then masking the result by using a mask, and then adding a specific classification token mark ([ CLS ]) at the beginning of each sequence, wherein the last layer output corresponding to the classification token is used for representing the whole sequence information. Meanwhile, the same sequence inserts a split token mark ([ SEP ]) between sentences, and adds embedded vector information indicating the position of each token to each token so as to distinguish different sentences. An example of token characterization is given in Table 2.
TABLE 2BERT word embedding Structure
The filling part and the vectorization part convert the data texts with different lengths into vector matrix formats with standard lengths, so that the BERT layer can process the data texts conveniently.
The main task of emotion analysis technology is to perform emotion analysis on a text to identify emotion information in the text. The text emotion classification model constructed herein is shown in fig. 3, and has 4 layers, namely a BERT layer, a Bi-LSTM layer, a full connection layer and an output layer in sequence. In the BERT preprocessing layer, the feature representation of the text vector is acquired, and then the acquired feature representation is input into the BiLSTM layer to extract the emotion feature of the text. The BERT model well compensates for the defect that BiLSTM cannot pay attention to text context information, and finally the classifier classifies the extracted features.
In this embodiment, a BERT pre-training model is selected for text pre-processing to obtain a comprehensive word vector representation, and the BERT structure is shown in fig. 4. In the BERT model, the input vector is accumulated from 3 different vectors. The first is a Word static vector (Token Embedding) which can be obtained by Word2vec technique, the second is a position vector (Positional Embedding) for embedding and retaining the relative position or absolute position information of the Word in the corpus, and the third is a sentence vector (Segment Embeddings), that is to say the input is a sentence, so only one sentence vector is used and a [ CLS ] flag vector is added to each sentence for reflecting the information of the whole sentence and a sentence end flag vector [ SEP ] for dividing two sentences in the text in the following work.
The BERT model mainly consists of an encoder-decoder structure of a bidirectional converter, as shown in fig. 5, a self-attention mechanism is a main technology of the BERT encoder, BERT depends on matrix operation, input vectors are spliced into vector matrixes E, e= { E1,e2…en }, and the vector matrixes are delivered to the converter, and the output of the vector matrixes are shown in the formulas (1) and (2).
Q=EWQ,K=EWK,V=EWV (1)
In which the Softmax function represents normalizationEach row vector after the operation is used for calculating the importance of each word to other words, Q represents a query matrix, K, V represents a word vector matrix, a penalty factorIs to prevent QKT has an excessive inner product. dK represents the vector dimension. WQ, WK, and WV each represent a linear transformation matrix.
After computing the output of the self-attention mechanism, a multi-headed attention mechanism output, denoted X, may be obtained.
The output vector matrix of the last layer in the BERT model is marked as T= { T1,t2…tn }, the dimension of T is the same as the input matrix E, and each dimension is a word segmentation vector used for representing depth and is used as Bi-LSTM network input.
The bi-directional encoder structure therein illustrates that when a model processes a certain vocabulary, it can describe some semantics in other vocabularies in the context through semantic relations of the context, while the BERT masks sentences and words with masks, so that the model learns sequential feature information of sentences based on predictions of next sentences at the input level.
In this embodiment, a BERT chinese pretraining model "BERT-base" of google open source may be used, where the pretraining model uses a 12-layer transform network structure, which contains 12 multi-head attentives. The dimension of the output vector is 768 dimensions, the maximum length is 128, and the deficiency can be filled and filled. The basic structure of the transducer consists of a multi-head self-attention mechanism and a full-connection feedforward network, the data firstly passes through the multi-head attention layer, the weighted feature vector is acquired, the data is sent to the full-connection feedforward network layer, the feature is extracted through a bidirectional encoder, and finally, the word-level vectorization representation of the text is output and is used as the input of BiLSTM and is transmitted to BiLSTM for the subsequent training task.
BiLSTM emotion feature extraction layer, LSTM replaces a node of a conventional RNN model with a special structure (cell). BiLSTM is a model of forward LSTM and backward LSTM superimposed on each other. The model is better used for bi-directional semantic capture, consisting of two LSTM inversions. By superimposing the forward LSTM with the backward LSTM, both forward semantics in the text and reverse semantics information can be obtained, and the output is from the two LSTM joint decision states, as shown in FIG. 6.
After the vector is input by the input layer, the bidirectional LSTM model respectively performs forward and backward calculation, and as shown in the following formula, the updating formula of the LSTM from front to back is as follows:
the back-to-front formula is:
The output formula after superposition is:
Where y2 is the output of Bi-LSTM after n times, W is the weight of the network, H is the bias, and H is the number of hidden units.
In the embodiment, the number of hidden units in the first layer is 128, the number of hidden units in the second layer is 96, the output dimension of the first layer of the full-connection layer is 32 dimensions, and the output dimension of the second layer is 2 dimensions. In training the model, the model parameters are updated using the back propagation mechanism using cross entropy as a loss function.
The Bi-LSTM network outputs a sequence containing a Bi-directional hidden state, and the function of the full connection layer is to transform the multi-dimensional feature vector into a low-dimensional feature vector, and transmit the low-dimensional feature vector to the output layer. The main task of the output layer is to classify the text according to the degree and type of emotion, so as to evaluate the security consciousness in the text. Firstly, receiving data transmitted by a Bi-LSTM layer subjected to full-connection layer dimension reduction processing, adopting a Softmax function as a classifier, carrying out normalization processing on the feature vector, calculating to obtain a probability value corresponding to each emotion type, and finally selecting the type with the largest probability value as a final output result of a model to determine the emotion type. In the embodiment, the texts are divided into three categories, namely higher security consciousness, general security consciousness and lower security consciousness. The specific classification method is as follows, text is classified into an active category, a general category and a passive category through model training results. The text is further classified according to the particular vocabulary (e.g., "confidential," "secret," etc.) that appears in the text.
The training of the emotion analysis and judgment model by using the text word vector as sample data can be designed to comprise:
Dividing the text word vector into a training sample set and a test sample set according to a specified proportion;
And performing iterative training on the emotion analysis and judgment model based on a preset model loss function by utilizing text word vectors in a training sample set, and performing performance evaluation, adjustment and optimization on the trained emotion analysis and judgment model by utilizing a test sample set so as to obtain the emotion analysis and judgment model with the model training effect meeting the expected requirement.
And carrying out statistical analysis and visual display on the security consciousness emotion tendency category prediction results of the plurality of text data reactions by using a visual display platform, wherein the visual display platform adopts a B/S architecture, and the statistical analysis comprises statistical analysis of text structures and statistical analysis of the security consciousness emotion tendency categories.
The task of the visualization technology is to perform certain statistical analysis and visual display on the evaluation result so as to facilitate the user to view and understand. The visual display platform adopts a B/S architecture, the server is composed of emotion analysis servers, the front end is based on JavaScript design, data visualization is realized by adopting Echart, input text structures such as average sentence length distribution, keyword frequency and the like are mainly displayed, meanwhile, on the basis of further data analysis on security consciousness research and judgment classification results, information duty pie charts, bar charts and the like with different security consciousness degrees are provided, and the security consciousness level of personnel is intuitively displayed.
Further, based on the method, the embodiment of the invention also provides a target personnel security consciousness assessment system based on the multi-element emotion analysis, which comprises a corpus construction module, a model training module and a target studying and judging module, wherein,
The corpus construction module is used for constructing a text corpus in the security field, preprocessing data of the text corpus in the security field to optimize text data in the corpus and convert texts into text word vectors, wherein the text word vectors are obtained by accumulating text word static vectors, text word position vectors and text sentence vectors;
The model training module is used for constructing an emotion analysis and judgment model and training the emotion analysis and judgment model by taking a text word vector as sample data, wherein the emotion analysis and judgment model comprises a BERT layer for mining the text vector based on text context semantics and a text sentence front-back sequence relationship, a BiLSTM network for extracting bidirectional text features of the text vector and a fully-connected output layer for classifying and outputting the text features;
The target research and judgment module is used for collecting the related texts of the privacy topics of the target crowd, predicting the privacy consciousness emotion tendency type of the target crowd by utilizing the trained emotion analysis and judgment model, and visually displaying the prediction result.
As shown in fig. 7, the whole algorithm flow is divided into four parts of text extraction, model training, analysis and judgment and result display. And text extraction is carried out on the text in the field of the specific secret topics by means of data crawling, data construction and the like and by utilizing social network platforms such as microblogs, knowledgeable networks and other network text sources. Dividing the preprocessed text into a training set and a testing set according to a certain proportion, wherein the training set is used for training a BERT-BiLSTM emotion analysis and judgment model, and the testing set is used for model performance evaluation and adjustment optimization. Inputting the preprocessed training data into an emotion analysis and judgment model, sequentially passing through a BERT layer, a BiLSTM layer and a full-connection layer, completing a data preprocessing model in the BERT layer, taking BERT layer output as BiLSTM network input to extract bidirectional text information, and finally repeatedly carrying out parameter iteration according to a loss function to optimize a model structure until the model training effect meets the requirement. The text information in the unlabeled personnel security topic field can be obtained by inputting texts such as security revealing events, security system evaluation and the like into a emotion analysis and judgment model for analysis and judgment, classifying security consciousness through emotion degree and outputting personnel security consciousness level. The result is visualized by Echart, different security consciousness levels of the data reaction are further analyzed, firstly, the structure of the text is analyzed, secondly, the duty ratio of the different security consciousness levels is analyzed, and the like, and the research and judgment result is intuitively presented to the user. And if single data are predicted, obtaining a text classification result to be predicted. As shown in fig. 8, if there are a plurality of pieces of data, the data are further analyzed and processed, and the result is displayed.
To verify the validity of this protocol, the following is further explained in connection with experimental data:
The experiment uses Python to collect related comments, news and the like of topics in fields of confidentiality, disclosure, information security and the like in websites such as voice trembling, fast handholding, learning, microblog, serging, confidentiality and the like, and on the basis, a confidentiality topic corpus is formed through data expansion and self-built data set fusion.
The experiment comprises a plurality of preprocessing steps, including character invalid in the transfer of experimental data, positive and negative labels of emotion (label 0 represents neutral, label 1 represents negative, label 2 represents positive), deleting emoji character, stopping word deletion, normalizing words and splitting data, wherein 1511 data are selected from the preprocessed data as a data set of the experiment, 535 data with label being neutral, 426 data with label being negative, and 350 data with label being positive. The data sets of the different labels are then divided into training sets, verification sets and test sets according to the ratio of about 10:1:1, the training sets 1311, the verification sets 100, the test sets 100, and part of the data is shown in fig. 9. Finally, the test set is used to verify the function of the module. To measure and analyze model performance, some important toolkits to build workflow and some statistical evaluation metrics are used.
All experiments were conducted in the same configuration, using Legion Y7000,7000-1060 notebook computer with i7 processor, 8G memory, and specific software and hardware configuration as shown in table 3:
TABLE 3 experiment operating Environment
For the experiments, some mathematical and statistical tools were used. After the mathematical model is built, the model is implemented by encoding in the Python programming language. Python3.6 is the coding tool and the preferred tool for this experiment. The data was made algorithm compatible using Scikit-learn. Keras is used to assist in building BiLSTM models. Furthermore Keras plays the most critical role in the intersection of BiLSTM with artificial neural networks. The experimental parameters were set up in two parts, the experimental parameters of the BERT model and the experimental parameters of the BiLSTM model, as shown in table 4.
TABLE 4 model parameters
The self-built data set is used for testing, and the test results show that the scheme has higher accuracy and practicability through establishing proper evaluation indexes and comparing the evaluation indexes with other models.
The training results of the model were evaluated mainly using four indices, namely, accuracy A (Accuracy), accuracy P (Precision) Recall R (Recall), and F1 score (F-score), respectively.
The above results are obtained from a confusion matrix (Confusion Matrix) under the classification:
TABLE 5 confusion matrix
TP in the confusion matrix indicates that the original label is positive, namely the label is positive after classification, namely the correct positive sample number is predicted, FN indicates that the original label is negative, but the classification label is positive, namely the negative sample number in the classification error, FP indicates that the original label is positive, namely the classification label is negative, namely the positive sample number in the classification error, TN is negative, namely the negative sample number in the classification correct after classification.
The accuracy refers to the proportion of the correct number of samples to the total number after model classification, wherein the proportion includes positive and negative cases. The formula can be calculated as:
Precision-Recall (Precision-Recall) is typically used simultaneously, precision being for classification and Recall being for model. The accuracy refers to the sample duty ratio in which the true case is positive among samples predicted to be positive.
Recall refers to the positive sample duty cycle of the classifier in samples where the classifier is truly positive.
The F1 score combines the precision rate and the recall rate as an index, which is the harmonic mean value of the precision rate and the recall rate:
In addition, other training models are also set and trained for further analysis of experimental results. Three different training models of TextCNN, LSTM and BERT-BiLSTM are adopted, and the three training models are placed in the same data set to carry out the comparison of emotion analysis tasks, so that the comparison is used for verifying the strong text representation capability of BERT-BilSTM in the scheme.
And secondly, comparing the model constructed in the scheme with other proposed emotion analysis models to verify whether the model constructed in the scheme improves the emotion classification accuracy compared with the model of traditional deep learning.
Experimental training set data 1311, test set, validation set data 100. In the model training process, iter parameters represent iteration times, and Epoch parameters represent training round numbers and are set to 3 rounds. In the change of model training parameter loss values after every 10 iterations, the model effect optimization is achieved 6 times, the star mark is used for training, and the total time is 17 minutes. The same dataset was trained using the classical FastText model as a control experiment, illustrated as model training parameters, wherein Embedding layers were randomly generated, dropout was set to 0.5, the number of training rounds the model took was 20, and the training was shared for 44 minutes.
The results of emotion analysis for the different pre-trained models are shown in table 6.
TABLE 6 statistics of experimental test results
Evaluation indexAccuracy(%)Recall(%)F1-score(%)
FastText80.6379.4479.84
BERT-BiLSTM95.0695.5695.19
On a comparative experiment, fastText and BERT-BiLSTM models were selected for training and result comparison. The result shows that the accuracy of emotion classification of BERT-BiLSTM is improved by 14.43 percent compared with FastText, the recall rate is improved by 16.22 percent, and the F1 fraction is improved by 15.35 percent. Meanwhile, in the optimization process, BERT-BiLSTM only adopts epoch=3 rounds of training, while FastText adopts epoch=20 rounds of training, and although the training speed is faster than that of a model in the scheme under the same round number of FastText, the accuracy is relatively lower. The BERT-BiLSTM makes feature selection by context information of the vocabulary, and dynamically adjusts the vocabulary vectors according to the change of the context information, thereby proving that the BERT pre-training model is more beneficial to the text information extraction of the model from the comprehensive index.
And if new data is set as model input, predicting the emotion classification result. A single piece of data "must be kept secret, without classifying leakage" as 2 (positive). The multiple data lists are { "to announce the matter out" "" the matter must be kept secret, no wind leakage sound can occur "" "the matter must be kept secret, no other person knows" "" information must exist secret leakage risk "" "vigilance secret leakage, secret leakage prevention and secret leakage prevention" } classification result is {1,2,2,1,0}, namely { negative, positive, active, negative and neutral }, and the result shows that the model in the scheme has better prediction capability for new texts.
The experimental data show that the Bert-BiLSTM pre-training two-way neural network model in the scheme has a great improvement on the context compared with the traditional text classification model, and better results can be obtained by carrying out emotion analysis in the security field.
The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The elements and method steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or a combination thereof, and the elements and steps of the examples have been generally described in terms of functionality in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different methods for each particular application, but such implementation is not considered to be beyond the scope of the present invention.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a program that instructs associated hardware, and the program may be stored on a computer readable storage medium such as a read-only memory, a magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiments may be implemented in hardware or may be implemented in a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
It should be noted that the foregoing embodiments are merely illustrative embodiments of the present invention, and not restrictive, and the scope of the invention is not limited to the embodiments, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features of the embodiments described in the foregoing embodiments may be easily contemplated within the scope of the present invention, and the spirit and scope of the technical solutions of the embodiments do not depart from the spirit and scope of the embodiments of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

Translated fromChinese
1.一种基于多元情感分析的目标人员保密意识评估方法,其特征在于,包含:1. A method for evaluating target personnel's confidentiality awareness based on multivariate sentiment analysis, characterized by comprising:构建保密领域文本语料库,并对保密领域文本语料库进行数据预处理,以优化语料库中的文本数据并将文本转化为文本词向量,所述文本词向量由文本词静态向量、文本词位置向量及文本句子向量累加得到;Constructing a confidentiality domain text corpus and performing data preprocessing on the confidentiality domain text corpus to optimize the text data in the corpus and convert the text into a text word vector, wherein the text word vector is accumulated by adding a text word static vector, a text word position vector and a text sentence vector;构建情感分析研判模型,并利用文本词向量作为样本数据对情感分析研判模型进行训练,所述情感分析研判模型包括用于基于文本上下文语义和文本句子前后顺序关系挖掘文本向量的BERT层、用于对文本向量进行双向文本特征提取的BiLSTM网络及对文本特征进行分类输出的全连接输出层;Construct a sentiment analysis model, and use text word vectors as sample data to train the sentiment analysis model, wherein the sentiment analysis model includes a BERT layer for mining text vectors based on text context semantics and the order of text sentences, a BiLSTM network for bidirectional text feature extraction of text vectors, and a fully connected output layer for classifying and outputting text features;收集目标人群保密话题相关文本,利用训练后的情感分析研判模型预测目标人群保密意识情感倾向类型,并将预测结果进行可视化展示。Collect texts related to confidentiality topics of the target population, use the trained sentiment analysis model to predict the target population's confidentiality awareness emotional tendency type, and visualize the prediction results.2.根据权利要求1所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,构建保密领域文本语料库,包含:2. The target personnel confidentiality awareness assessment method based on multivariate sentiment analysis according to claim 1 is characterized in that a confidentiality domain text corpus is constructed, comprising:收集保密话题相关文本,所述保密话题相关文本包括保密话题相关社交网络评论、保密话题相关新闻及保密话题自建数据集;Collecting texts related to confidential topics, wherein the texts related to confidential topics include social network comments related to confidential topics, news related to confidential topics, and self-built datasets on confidential topics;对保密话题相关文本的保密意识情感倾向类型进行标签标注,并通过分析识别保密话题相关文本中的实体;Label the confidentiality awareness sentiment type of confidentiality-related texts, and identify entities in confidentiality-related texts through analysis;通过实体、近义词和反义词替换扩充保密话题相关文本数据,基于扩充后的保密话题相关文本数据构建保密话题相关文本。The confidential topic-related text data is expanded by replacing entities, synonyms and antonyms, and the confidential topic-related text is constructed based on the expanded confidential topic-related text data.3.根据权利要求2所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,通过实体、近义词和反义词替换扩充保密话题相关文本数据,包含:3. The target personnel confidentiality awareness assessment method based on multivariate sentiment analysis according to claim 2 is characterized in that the confidentiality topic related text data is expanded by replacing entities, synonyms and antonyms, including:在保密话题相关文本的响应实体类别内寻找相关替换实体,并生成新的文本数据;Find relevant replacement entities within the response entity category of the text related to the confidential topic and generate new text data;通过对保密话题相关文本进行词性分析,并查找保密话题性相关文本中相关替换词性,利用相关替换词性的近义词和/或反义词进行替换,并产生对应新的标注标签,以生成新的文本数据。New text data is generated by performing part-of-speech analysis on text related to confidential topics, searching for relevant replacement parts of speech in text related to confidential topics, replacing them with synonyms and/or antonyms of the relevant replacement parts of speech, and generating corresponding new annotation tags.4.根据权利要求1所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,对保密领域文本语料库进行数据预处理,包含:4. The target personnel confidentiality awareness assessment method based on multivariate sentiment analysis according to claim 1 is characterized in that data preprocessing is performed on the confidentiality field text corpus, comprising:对语料库中的文本进行筛选,过滤并剔除无效文本,并去除文本数据中冗余字符,所述无效文本包括:相关字段为空的文本和与保密话题无关的文本,所述冗余字符包括缺失值、重复值及表情符号;Screen the text in the corpus, filter and remove invalid text, and remove redundant characters in the text data. The invalid text includes: text with empty related fields and text unrelated to confidential topics. The redundant characters include missing values, repeated values and emoticons;利用正则表达式匹配方法对文本数据进行清洗,以去除文本数据中无意义字符;Use regular expression matching method to clean text data to remove meaningless characters in text data;通过数据分块将文本数据中超过阈值的长文本和大文件拆分为指定份数的短文本和小文件;By data segmentation, long texts and large files exceeding a threshold in text data are split into a specified number of short texts and small files;利用Wordpiece对文本数据中进行分词,利用mask掩盖分词结果,在每个分词序列开头处加上分类token标记,并利用分类token对应的最后一层输出表征整个分词序列信息,同一分词序列在句与句中间插入分割token标记,为每一个token添加指示其位置的嵌入向量信息;Use Wordpiece to segment text data, use mask to cover the segmentation results, add a classification token at the beginning of each segmentation sequence, and use the last layer output corresponding to the classification token to represent the entire segmentation sequence information. For the same segmentation sequence, insert segmentation tokens between sentences, and add an embedding vector information indicating its position to each token.将长度不一的文本分词序列转化为标准长度的词向量。Convert text word sequences of varying lengths into word vectors of standard length.5.根据权利要求1所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,所述全连接输出层包括:用于将多维特征向量变换为指定低维特征向量的全连接层,及用于根据情感程度和类型对低维特征向量进行分类并确定保密意识情感倾向类别的输出层;所述输出层采用softmax函数作为分类器,利用softmax函数计算每个保密意识情感倾向类别对应的概率值并选取概率值最大的类别作为最终输出情感倾向类别。5. According to the method for evaluating the confidentiality awareness of target personnel based on multivariate sentiment analysis in claim 1, it is characterized in that the fully connected output layer includes: a fully connected layer for transforming a multidimensional feature vector into a specified low-dimensional feature vector, and an output layer for classifying the low-dimensional feature vector according to the degree and type of emotion and determining the confidentiality awareness emotional tendency category; the output layer uses a softmax function as a classifier, uses the softmax function to calculate the probability value corresponding to each confidentiality awareness emotional tendency category and selects the category with the largest probability value as the final output emotional tendency category.6.根据权利要求1或5所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,利用文本词向量作为样本数据对情感分析研判模型进行训练,包含:6. The target personnel confidentiality awareness assessment method based on multivariate sentiment analysis according to claim 1 or 5 is characterized in that the sentiment analysis model is trained using text word vectors as sample data, comprising:将文本词向量按指定比例划分为训练样本集和测试样本集;Divide the text word vector into a training sample set and a test sample set according to the specified ratio;基于预设的模型损失函数并利用训练样本集中的文本词向量对情感分析研判模型进行迭代训练,利用测试样本集对训练后的情感分析研判模型进行性能评估和调整优化,以获取模型训练效果满足预期要求的情感分析研判模型。Based on the preset model loss function and using the text word vectors in the training sample set, the sentiment analysis model is iteratively trained. The performance of the trained sentiment analysis model is evaluated and adjusted using the test sample set to obtain a sentiment analysis model whose model training effect meets the expected requirements.7.根据权利要求1所述的基于多元情感分析的目标人员保密意识评估方法,其特征在于,将预测结果进行可视化展示,包含:7. The target personnel confidentiality awareness assessment method based on multivariate sentiment analysis according to claim 1 is characterized in that the prediction results are visualized, comprising:利用可视化展示平台对多条文本数据反应的保密意识情感倾向类别预测结果进行统计分析并可视化展示,所述可视化展示平台采用B/S架构,所述统计分析包括对文本结构的统计分析和保密意识情感倾向类别的统计分析。A visualization display platform is used to perform statistical analysis and visualization on the prediction results of confidentiality awareness emotional tendency categories reflected by multiple text data. The visualization display platform adopts a B/S architecture, and the statistical analysis includes statistical analysis of text structure and statistical analysis of confidentiality awareness emotional tendency categories.8.一种基于多元情感分析的目标人员保密意识评估系统,其特征在于,包含:语料库构建模块、模型训练模块和目标研判模块,其中,8. A target personnel confidentiality awareness assessment system based on multivariate sentiment analysis, characterized by comprising: a corpus construction module, a model training module and a target analysis module, wherein:语料库构建模块,用于构建保密领域文本语料库,并对保密领域文本语料库进行数据预处理,以优化语料库中的文本数据并将文本转化为文本词向量,所述文本词向量由文本词静态向量、文本词位置向量及文本句子向量累加得到;A corpus construction module is used to construct a confidentiality domain text corpus and perform data preprocessing on the confidentiality domain text corpus to optimize the text data in the corpus and convert the text into a text word vector, wherein the text word vector is accumulated by a text word static vector, a text word position vector and a text sentence vector;模型训练模块,用于构建情感分析研判模型,并利用文本词向量作为样本数据对情感分析研判模型进行训练,所述情感分析研判模型包括用于基于文本上下文语义和文本句子前后顺序关系挖掘文本向量的BERT层、用于对文本向量进行双向文本特征提取的BiLSTM网络及对文本特征进行分类输出的全连接输出层;A model training module is used to build a sentiment analysis model and train the sentiment analysis model using text word vectors as sample data. The sentiment analysis model includes a BERT layer for mining text vectors based on text context semantics and the order of text sentences, a BiLSTM network for bidirectional text feature extraction from text vectors, and a fully connected output layer for classifying and outputting text features;目标研判模块,用于收集目标人群保密话题相关文本,利用训练后的情感分析研判模型预测目标人群保密意识情感倾向类型,并将预测结果进行可视化展示。The target analysis module is used to collect texts related to confidentiality topics of the target population, use the trained sentiment analysis model to predict the target population's confidentiality awareness emotional tendency type, and visualize the prediction results.9.一种电子设备,其特征在于,包括:9. An electronic device, comprising:至少一个处理器,以及与所述至少一个处理器耦合连接的存储器;at least one processor, and a memory coupled to the at least one processor;其中,所述存储器存储有计算机程序,所述计算机程序能够被所述至少一个处理器执行,以实现如权利要求1~7任一项所述的方法。The memory stores a computer program, and the computer program can be executed by the at least one processor to implement the method according to any one of claims 1 to 7.10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,当计算机程序被执行时,能够实现如权利要求1~7任一项所述的方法。10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and when the computer program is executed, the method according to any one of claims 1 to 7 can be implemented.
CN202411539107.3A2024-10-312024-10-31Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysisPendingCN119719366A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411539107.3ACN119719366A (en)2024-10-312024-10-31Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411539107.3ACN119719366A (en)2024-10-312024-10-31Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis

Publications (1)

Publication NumberPublication Date
CN119719366Atrue CN119719366A (en)2025-03-28

Family

ID=95079588

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411539107.3APendingCN119719366A (en)2024-10-312024-10-31Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis

Country Status (1)

CountryLink
CN (1)CN119719366A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119941480A (en)*2025-04-072025-05-06贵阳市金阳建设数据服务有限公司 Bus full life cycle safety management method and system combined with big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119941480A (en)*2025-04-072025-05-06贵阳市金阳建设数据服务有限公司 Bus full life cycle safety management method and system combined with big data

Similar Documents

PublicationPublication DateTitle
DaThe computational case against computational literary studies
Feuerriegel et al.Using natural language processing to analyse text data in behavioural science
Salloum et al.A survey of text mining in social media: Facebook and Twitter perspectives
US11188819B2 (en)Entity model establishment
Srikanth et al.[Retracted] Sentiment Analysis on COVID‐19 Twitter Data Streams Using Deep Belief Neural Networks
CN114265936A (en) A Realization Method of Text Mining for Science and Technology Projects
Majeed et al.Deep-EmoRU: mining emotions from roman urdu text using deep learning ensemble
US8140464B2 (en)Hypothesis analysis methods, hypothesis analysis devices, and articles of manufacture
Suleiman et al.Arabic sentiment analysis using Naïve Bayes and CNN-LSTM
CN119719366A (en)Target personnel confidentiality consciousness assessment method and system based on multivariate emotion analysis
Wani et al.CoDeS: A deep learning framework for identifying COVID-caused depression symptoms
Liu et al.Age inference using a hierarchical attention neural network
Sweidan et al.Autoregressive feature extraction with topic modeling for aspect-based sentiment analysis of arabic as a low-resource language
Javed et al.BERT model adoption for sarcasm detection on Twitter data
Yang et al.Evaluation and assessment of machine learning based user story grouping: A framework and empirical studies
Muslim et al.Comparison of accuracy between long short-term memory-deep learning and multinomial logistic regression-machine learning in sentiment analysis on twitter
CN114896504B (en) Recommendation method, recommendation device, electronic device and storage medium
Rybak et al.Machine learning-enhanced text mining as a support tool for research on climate change: theoretical and technical considerations
Alruwais et al.Modified arithmetic optimization algorithm with Deep Learning based data analytics for depression detection
CN113326348A (en)Blog quality evaluation method and tool
Sangsavate et al.Experiments of supervised learning and semi-supervised learning in Thai financial news sentiment: a comparative study
Gao et al.Depression detection in social media using XLNet with topic distributions
Kamalam et al.A Text-Based Approach for Diagnosing Depression Using Social Media Texts
Ayash et al.Advancements in feature selection and extraction methods for text mining: a review
Zhang et al.Text Sentiment Analysis with Event Information

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp