Movatterモバイル変換


[0]ホーム

URL:


CN117131159A - Method, device, equipment and storage medium for extracting sensitive information - Google Patents

Method, device, equipment and storage medium for extracting sensitive information
Download PDF

Info

Publication number
CN117131159A
CN117131159ACN202311108184.9ACN202311108184ACN117131159ACN 117131159 ACN117131159 ACN 117131159ACN 202311108184 ACN202311108184 ACN 202311108184ACN 117131159 ACN117131159 ACN 117131159A
Authority
CN
China
Prior art keywords
sensitive information
information
text
extraction
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202311108184.9A
Other languages
Chinese (zh)
Inventor
郭大勇
欧阳奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Tongban Information Service Co ltd
Original Assignee
Shanghai Tongban Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tongban Information Service Co ltdfiledCriticalShanghai Tongban Information Service Co ltd
Priority to CN202311108184.9ApriorityCriticalpatent/CN117131159A/en
Publication of CN117131159ApublicationCriticalpatent/CN117131159A/en
Withdrawnlegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The application discloses a method, a device, equipment and a storage medium for extracting sensitive information in the technical field of information processing, wherein the method comprises the following steps: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text; training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information; inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information. According to the method, after the model is trained by using the text data marked with the sensitive information, the sensitive information can be extracted by using the model obtained by training, so that the method is applicable to different scenes or requirements, has universality, does not need to write a program before each extraction, and can effectively improve the extraction efficiency of the sensitive information.

Description

Method, device, equipment and storage medium for extracting sensitive information
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting sensitive information.
Background
Along with the rapid development of information technology, the files required to be inspected by related institutions are increasingly huge, and a large amount of sensitive information is accumulated continuously; with the development of society and age, desensitization has become increasingly important.
For the extraction of sensitive information, programmers write scattered programs to operate files through specific scenes or according to specific requirements at present, the variability is relatively large, and a fixed standard is not formed; and each time sensitive information is extracted, a program needs to be written, and then files can be operated based on the written program, so that the extraction efficiency of the sensitive information is low.
Disclosure of Invention
The application aims to provide a method, a device, equipment and a storage medium for extracting sensitive information, which can be suitable for different scenes or requirements, have universality and can effectively improve the extraction efficiency of the sensitive information.
In order to achieve the above object, the present application provides the following technical solutions:
a method of extracting sensitive information, comprising:
acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;
training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;
inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.
Preferably, before training the preset model by using the data set, the method further includes:
determining sensitive information marked in a marked text as appointed information;
determining other sensitive information with similarity reaching a similarity threshold value with the specified information, and replacing the specified information by the other sensitive information to obtain a new labeling text;
and adding the obtained new annotation file into the data set.
Preferably, before determining other sensitive information of which the similarity with the specified information reaches the similarity threshold, the method further comprises:
and respectively acquiring word vectors of the specified information and any other sensitive information, and calculating the similarity between the two word vectors as the similarity between the two corresponding information.
Preferably, before training the preset model by using the data set, the method further includes:
and replacing a random named entity in the random labeling text by using other words with the same properties for any labeling text in the data set, and adding the new labeling text obtained after replacement into the data set.
Preferably, before inputting the text to be extracted without marked sensitive information into the information extraction model, the method further comprises:
acquiring a target file and judging whether the target file is a document in a preset format;
if yes, reading the target file according to the line, and taking the text read each time as the text to be extracted respectively; and if not, filtering the target file.
Preferably, after determining the sensitive information in the text to be extracted based on the extraction result, the method further includes:
and counting the quantity, the number of rows and the proportion of the sensitive information in the target file, and outputting the counting result in a structuring way.
Preferably, the preset model is a bert+bilstm+crf model.
An apparatus for extracting sensitive information, comprising:
an acquisition module for: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;
training module for: training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;
an extraction module for: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.
An apparatus for extracting sensitive information, the apparatus comprising a memory and a processor, a program stored on the memory which when executed by the processor performs the steps of the method of extracting sensitive information as claimed in any one of the preceding claims.
A computer readable storage medium, which when executed by a processor implements the steps of the method of extracting sensitive information as claimed in any one of the preceding claims.
The application provides a method, a device, equipment and a storage medium for extracting sensitive information, wherein the method comprises the following steps: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text; training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information; inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information. According to the technical scheme, the text data training model marked with the sensitive information is utilized to obtain the extraction model for extracting the sensitive information, unlabeled text is directly input into the extraction model when needed, the sensitive information in the unlabeled text is extracted based on the result output by the extraction model, and effective extraction of the sensitive information is achieved. Therefore, after the model is trained by using the text data marked with the sensitive information, the extraction of the sensitive information can be realized by using the model obtained by training, and the method is different from the prior art that programmers write scattered programs aiming at specific scenes or specific requirements, can be suitable for different scenes or requirements, has universality, is different from the prior art that the programs are written before each extraction, and can be directly used for extracting the sensitive information, thereby effectively improving the extraction efficiency of the sensitive information.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for extracting sensitive information according to an embodiment of the present application;
FIG. 2 is a diagram of a model structure in a method for extracting sensitive information according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating preprocessing in a method for extracting sensitive information according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a device for extracting sensitive information according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, a flowchart of a method for extracting sensitive information according to an embodiment of the present application may specifically include:
s11: and acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with the sensitive information contained in the labeling text.
The sensitive information can be a sensitive word determined according to actual needs, such as position information, certificate information and the like. The text in the embodiment of the application can be word text, the corresponding data set is word data set, and other settings such as txt text and the like can be performed according to actual needs; the word text is taken as an example for specific explanation.
The embodiment of the application can label Chinese, english, pinyin, sentences and the like of the sensitive information in the text by adopting BIO labeling, so that the labeled text labeled with the sensitive information contained in the labeled text can be obtained. In particular, sentences or phrases for model training may be prepared, which contain a feature ner_label, which represents sensitive information entities in the sentences or phrases, i.e. BIO tagged entities; the sentences or phrases marked with the sensitive information contained in the sentences or phrases are marked texts, so that a data set containing a plurality of marked texts is obtained. In the BIO notation, B represents the beginning (Begin) of the sensitive information entity, I represents the interior (Inside) of the sensitive information entity, and O represents the non-entity part (outlide); for example, for the sentence "I live in the sunny region," where "sunny region" is sensitive information, the following BIO labeling may be performed: i live in the yang-ward region, OOOOBII.
S12: training a preset model by using the data set to obtain an information extraction model for extracting sensitive information.
Training the preset model by using the data set in step S11, an information extraction model for extracting the sensitive information may be obtained. The preset model may be a model selected in advance according to actual needs, such as a neural network model, a self-encoder model, and the like, which is not limited herein.
S13: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.
After the information extraction model for extracting the sensitive information is obtained, if the sensitive information in a certain text is required to be extracted, the text can be input into the information extraction model, the information extraction model can output a result for marking the sensitive information in the text, and the sensitive information in the text can be directly positioned based on the result, so that the sensitive information can be rapidly extracted.
According to the technical scheme, the text data training model marked with the sensitive information is utilized to obtain the extraction model for extracting the sensitive information, unlabeled text is directly input into the extraction model when needed, the sensitive information in the unlabeled text is extracted based on the result output by the extraction model, and effective extraction of the sensitive information is achieved. Therefore, after the model is trained by using the text data marked with the sensitive information, the extraction of the sensitive information can be realized by using the model obtained by training, and the method is different from the prior art that programmers write scattered programs aiming at specific scenes or specific requirements, can be suitable for different scenes or requirements, has universality, is different from the prior art that the programs are written before each extraction, and can be directly used for extracting the sensitive information, thereby effectively improving the extraction efficiency of the sensitive information.
After the data set is acquired and before model training is performed by using the data set, the embodiment of the application can also perform data enhancement processing on the data set so as to increase the diversity of training data.
In a specific implementation manner, before training the preset model by using the data set, the method may further include:
determining sensitive information marked in a marked text as appointed information;
determining other sensitive information with similarity reaching a similarity threshold value with the specified information, and replacing the specified information by the other sensitive information to obtain a new labeling text;
and adding the obtained new annotation file into the data set.
Before determining other sensitive information, the similarity between the specific information and the other sensitive information reaches the similarity threshold value, the method may further include:
and respectively acquiring word vectors of the specified information and any other sensitive information, and calculating the similarity between the two word vectors as the similarity between the two corresponding information.
This embodiment is based on word vector data enhancement processing. The similarity threshold may be set according to actual needs, if the similarity between two pieces of information reaches the similarity threshold, the two pieces of information are considered to be higher in similarity, otherwise, the two pieces of information are considered to be lower in similarity and dissimilar. And, the BERT vector of the sensitive information may be utilized as a word vector of the sensitive information.
Specifically, all possible sensitive information may be collected in advance, and then similarity calculation is performed on the sensitive information by using the existing BERT vector; selecting sensitive information with higher similarity as a candidate word set so as to obtain at least one candidate word set, wherein the sensitive information contained in the single candidate word set is the sensitive information with higher similarity; and using the sensitive information simultaneously contained in any candidate word set and the marked text as the designated information, and sequentially replacing the designated information in the marked text by using other sensitive information in the candidate word set where the designated information is located, so as to obtain new marked text respectively containing each other sensitive information. And after the candidate word set is obtained and the specified information is determined, only n (such as 3, 2 and the like) other sensitive information with the highest similarity with the specified information in the candidate word set where the specified information is located is used for replacing the specified information in the marked text, so that a new marked text which respectively contains each sensitive information in the n other sensitive information is obtained. The method comprises the steps of determining the sensitive information in the labeling text to be expanded as the appointed information, calculating the similarity between all possible sensitive information collected in advance and the appointed information, selecting the sensitive information with higher similarity as a candidate word set, and replacing the appointed information in the labeling text with n sensitive information in sequence or respectively according to the mode to obtain a corresponding new labeling text.
Because the sensitive information used for expanding the data set is the sensitive information in the labeling text, the similarity is higher, and the diversity of the training data can be effectively increased under the condition that the semantics of the labeling text are not changed. For example, for the sensitive information "sunward region" of the sentence "i live in sunward region", it is found that "sunward", "sunward masses", "east urban region" are words with higher similarity by comparison with the similarity of other words, so that the three words can be utilized for data set expansion, so that the diversity of training data can be increased without changing the sentence semantics.
In another specific implementation manner, before training the preset model by using the data set, the method may further include:
and replacing a random named entity in the random labeling text by using other words with the same properties for any labeling text in the data set, and adding the new labeling text obtained after replacement into the data set.
This particular implementation is a data enhancement process based on random word substitution. For named entities in the annotation text, randomly selecting one entity, randomly selecting a word from the vocabulary to replace the entity, generating the replaced annotation text, and adding the replaced annotation text into the data set to realize expansion of the data set. In addition, in order to make the semantic and logic of the label text after replacement reasonable, words with the same property as the random named entity can be selected to replace the label text, if the random named entity is a number, the word used for replacement is also a number, the random named entity is a food name, and the word used for replacement is also a food name; etc.
The probability of replacement can be set, 50% of probability is defined for replacement operation, for example, for a sentence of "Zhang san Shang an apple", the result of word segmentation is "Zhang san", "Zhan", "one", "apple", a word is randomly selected for replacement, for example, "one" is selected for replacement, and the replacement is replaced by "two", "three", so that a data set is expanded, and the performance and the robustness of a model are improved; moreover, in order that the replaced text should be semantically and logically reasonable, for example, the replaced words have certain context consistency in context, so that the text is not disordered or semantically confused, and most of the selected numbers are replaced.
The method for extracting the sensitive information provided by the embodiment of the application can further comprise the following steps before inputting the text to be extracted without the sensitive information into the information extraction model:
acquiring a target file and judging whether the target file is a document in a preset format;
if yes, reading the target file according to the line, and taking the text read each time as the text to be extracted respectively; and if not, filtering the target file.
After determining the sensitive information in the text to be extracted based on the extraction result, the method may further include:
and counting the quantity, the number of rows and the proportion of the sensitive information in the target file, and outputting the counting result in a structuring way.
Before the information extraction model is used for extracting the sensitive information, a target file can be acquired firstly, wherein the target file is a file which is not marked with the sensitive information and needs to be subjected to sensitive information extraction; and judging whether the target file is a document in a preset format, if so, reading the target file according to the line, taking the text read each time as the text to be extracted, and otherwise, filtering the target file. Specifically, the target file may be read in a row by python-docx, and the read text is predicted by a batch-in information extraction model, so as to predict the sensitive information contained in the currently-in text, and output a corresponding extraction result. Therefore, the text input into the information extraction model is ensured to be the text meeting the corresponding format requirement, and further the smooth realization of information extraction is ensured.
In addition, after the extraction result output by the information extraction model is obtained, the quantity of the sensitive information in the target file, the number of lines where the sensitive information is located and the proportion of the sensitive information in the target file can be counted, so that the sensitive information is processed into structured data and output, and the condition of the sensitive information in the target file can be conveniently obtained. Such as may be implemented in the following languages:
{"checkResult":[{
"fileName":"XXX",
"SensitiveWord":"A",
"wordCount":20,
"wordWeight":0.0003,
"wordLines":[3,5,6]
}]}
wherein, the checkResult represents the extraction result, the fileName represents the file name of the target file, the sensorword represents the sensitive information in the target file, the wordCount represents the statistical number of the sensitive information, the wordwight represents the proportion of the sensitive information in the whole article space of the target file, and the wordLines represents the number of lines of the sensitive information in the text of the target file.
In the method for extracting sensitive information provided by the embodiment of the application, the preset model may be specifically a bert+bilstm+crf model.
In the embodiment of the application, the information extraction model adopts a BERT+BiLSTM+CRF model, and the model mainly comprises 3 modules. Firstly, the input text passes through a BERT pre-training language model to obtain corresponding word vector representations, wherein the word vectors contain context information and can capture the semantics of words and the context of sentences; then, the word vector obtained by BERT is used as input, and is further processed through BiLSTM, the BiLSTM models the input word vector through a forward LSTM network and a backward LSTM network, and a hidden state sequence with rich context information is generated, so that the BiLSTM can capture long-distance dependency relationship in the input word vector, and more accurate feature representation is provided; and finally, the CRF decodes the output of the BiLSTM to generate a predicted labeling sequence, namely an extraction result, wherein the labeling sequence comprises BIO labels for each word and is used for indicating whether the labels belong to sensitive information entities, and the CRF module considers the dependency relationship among the labels and models the label sequence by defining global label transition probability so as to better capture the context association among the labels. Thereby completing the whole process of the identification of the sensitive information entity. The BERT+BiLSTM+CRF model structure is shown in FIG. 2.
The preprocessing of the BERT data, as shown in fig. 3, may include: adding the SEP characters at the beginning of the text and the end of the CLS, w1 represents each word in a single paragraph of the text, w2 represents the second word in the single paragraph of the text, and so on; converting the text into input text, carrying out residual linking through a layer of Multi-Head attribute, carrying out norm normalization, carrying out residual linking and norm normalization through a feedforward neural network, and finally outputting two vectors which are respectively output 1 and output 2, wherein the output 1 vectors are used for judging whether the text contains sensitive information, and the output 2 is used for inputting a BiLSTM model and carrying out named entity recognition so as to determine the position of the sensitive information.
It should be noted that the training the preset model to obtain the information extraction model may include: the input text is subjected to BERT (binary automatic test) pre-training coding to generate word vectors relevant to context; the BiLSTM respectively carries out sequence modeling on the BERT coded word vectors from front to back to obtain a hidden state sequence with rich context information; taking the output of BiLSTM as the input of CRF, and carrying out label classification on the sequence by CRF to generate a final named entity label sequence; the model is trained through back propagation and gradient descent, and model parameters are optimized, so that the model can obtain better performance on NER tasks. Wherein, the F1-score can be calculated as a model evaluation index by using a multi-classification model LeNet5 according to the following formula:
wherein P represents the accuracy rate, R represents the recall rate, F1 represents F1-score, alpha is the number of identified correct sensitive information, A is the number of total sensitive information, and B is the number of identified sensitive information.
Therefore, one task of the information extraction model of the embodiment is fused in the same model, the model has 1 input and 1 output, the identification of the sensitive information entity in each paragraph in the target file is finally completed, the F1 score is calculated, and finally the structure of each sensitive information is output.
In a specific implementation manner, a method for extracting sensitive information provided by the embodiment of the application specifically may include the following steps:
1. data set preparation: sentences or phrases containing sensitive information for model training are prepared and subjected to data enhancement processing, including word vector-based data enhancement and random word replacement-based data enhancement.
2. And training the information extraction model by utilizing the data set subjected to the data enhancement processing.
3. Sensitive information contained in each text is predicted using an information extraction model.
4. And carrying out structured output of the sensitive information related data based on the extraction structure.
The application can rapidly and automatically extract the sensitive information in the file in batches, count the number of the information and the proportion of the information in the whole article, greatly shorten the detection time of the complex word and improve the detection flexibility.
The embodiment of the application also provides a device for extracting the sensitive information, as shown in fig. 4, which specifically may include:
an acquisition module 11 for: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;
training module 12 for: training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;
an extraction module 13 for: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.
The device for extracting the sensitive information provided by the embodiment of the application can further comprise:
a first enhancement module for: before training a preset model by utilizing the data set, determining sensitive information marked in a marked text as specified information; determining other sensitive information with similarity reaching a similarity threshold value with the specified information, and replacing the specified information by the other sensitive information to obtain a new labeling text; and adding the obtained new annotation file into the data set.
The first enhancement module may also be configured to: before other sensitive information, the similarity of which with the specified information reaches a similarity threshold, is determined, word vectors of the specified information and any other sensitive information are respectively obtained, and the similarity between the two word vectors is calculated and used as the similarity between the two corresponding information.
The device for extracting the sensitive information provided by the embodiment of the application can further comprise:
a second enhancement module for: before training a preset model by using the data set, any labeling text in the data set is replaced by other words with the same properties, and a new labeling text obtained after replacement is added into the data set.
The device for extracting the sensitive information provided by the embodiment of the application can further comprise:
a reading module for: before inputting a text to be extracted without marked sensitive information into the information extraction model, acquiring a target file, and judging whether the target file is a document in a preset format or not; if yes, reading the target file according to the line, and taking the text read each time as the text to be extracted respectively; and if not, filtering the target file.
The device for extracting the sensitive information provided by the embodiment of the application can further comprise:
a statistics module for: and after the sensitive information in the text to be extracted is determined based on the extraction result, counting the quantity, the number of rows and the proportion of the sensitive information in the target file, and structuring and outputting the counting result.
The device for extracting the sensitive information provided by the embodiment of the application can be a BERT+BiLSTM+CRF model.
The embodiment of the application also provides a device for extracting the sensitive information, which can comprise a memory and a processor, wherein the program stored on the memory can realize the steps of the method for extracting the sensitive information when being run by the processor.
Embodiments of the present application also provide a computer-readable storage medium, which when executed by a processor, implements the steps of the method of extracting sensitive information as described in any of the above.
It should be noted that, for the description of the device, the apparatus and the relevant part in the storage medium for extracting the sensitive information provided in the embodiment of the present application, please refer to the detailed description of the corresponding part in the method for extracting the sensitive information provided in the embodiment of the present application, and the detailed description is omitted herein. In addition, the parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail, so that redundant descriptions are avoided.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

CN202311108184.9A2023-08-302023-08-30Method, device, equipment and storage medium for extracting sensitive informationWithdrawnCN117131159A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311108184.9ACN117131159A (en)2023-08-302023-08-30Method, device, equipment and storage medium for extracting sensitive information

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311108184.9ACN117131159A (en)2023-08-302023-08-30Method, device, equipment and storage medium for extracting sensitive information

Publications (1)

Publication NumberPublication Date
CN117131159Atrue CN117131159A (en)2023-11-28

Family

ID=88858002

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311108184.9AWithdrawnCN117131159A (en)2023-08-302023-08-30Method, device, equipment and storage medium for extracting sensitive information

Country Status (1)

CountryLink
CN (1)CN117131159A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118964950A (en)*2024-08-292024-11-15沥泉科技(成都)有限公司 A sensitive information extraction method and system based on natural language processing

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110008335A (en)*2018-12-122019-07-12阿里巴巴集团控股有限公司The method and device of natural language processing
CN114491018A (en)*2021-12-232022-05-13天翼云科技有限公司Construction method of sensitive information detection model, and sensitive information detection method and device
CN116150313A (en)*2022-08-262023-05-23马上消费金融股份有限公司Data expansion processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110008335A (en)*2018-12-122019-07-12阿里巴巴集团控股有限公司The method and device of natural language processing
CN114491018A (en)*2021-12-232022-05-13天翼云科技有限公司Construction method of sensitive information detection model, and sensitive information detection method and device
CN116150313A (en)*2022-08-262023-05-23马上消费金融股份有限公司Data expansion processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑贤茹,等: "基于BERT-BiLSTM-CRF的网络敏感词及变体实体识别", 《计算机与数字工程》, vol. 51, no. 7, 31 July 2023 (2023-07-31), pages 1585 - 1589*

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118964950A (en)*2024-08-292024-11-15沥泉科技(成都)有限公司 A sensitive information extraction method and system based on natural language processing

Similar Documents

PublicationPublication DateTitle
CN113011533B (en)Text classification method, apparatus, computer device and storage medium
CN113283551B (en)Training method and training device of multi-mode pre-training model and electronic equipment
US10762296B2 (en)Risk address identification method and apparatus, and electronic device
CN108628828B (en)Combined extraction method based on self-attention viewpoint and holder thereof
CN109726274B (en)Question generation method, device and storage medium
Hernault et al.HILDA: A discourse parser using support vector machine classification
CN106095753B (en)A kind of financial field term recognition methods based on comentropy and term confidence level
CN108255813B (en)Text matching method based on word frequency-inverse document and CRF
CN104881458B (en)A kind of mask method and device of Web page subject
CN112347778A (en)Keyword extraction method and device, terminal equipment and storage medium
US10831993B2 (en)Method and apparatus for constructing binary feature dictionary
CN110334186B (en)Data query method and device, computer equipment and computer readable storage medium
CN119988588A (en) A large model-based multimodal document retrieval enhancement generation method
CN113806493A (en)Entity relationship joint extraction method and device for Internet text data
CN109857957B (en)Method for establishing label library, electronic equipment and computer storage medium
CN116029280A (en)Method, device, computing equipment and storage medium for extracting key information of document
CN109684473A (en)A kind of automatic bulletin generation method and system
CN117131159A (en)Method, device, equipment and storage medium for extracting sensitive information
CN114330350B (en)Named entity recognition method and device, electronic equipment and storage medium
CN119810846A (en) Intelligent document review and traceability positioning method based on LLM natural language processing
CN117291192B (en)Government affair text semantic understanding analysis method and system
CN118395987A (en)BERT-based landslide hazard assessment named entity identification method of multi-neural network
CN114332476A (en)Method, device, electronic equipment, storage medium and product for identifying dimensional language
CN113688233A (en)Text understanding method for semantic search of knowledge graph
CN119577139B (en)Event association analysis method and system based on large language model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
WW01Invention patent application withdrawn after publication
WW01Invention patent application withdrawn after publication

Application publication date:20231128


[8]ページ先頭

©2009-2025 Movatter.jp