Movatterモバイル変換


[0]ホーム

URL:


CN117313721A - Document management method and device based on natural language processing technology - Google Patents

Document management method and device based on natural language processing technology
Download PDF

Info

Publication number
CN117313721A
CN117313721ACN202311320057.5ACN202311320057ACN117313721ACN 117313721 ACN117313721 ACN 117313721ACN 202311320057 ACN202311320057 ACN 202311320057ACN 117313721 ACN117313721 ACN 117313721A
Authority
CN
China
Prior art keywords
document
target document
word
target
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311320057.5A
Other languages
Chinese (zh)
Inventor
郑诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBCfiledCriticalIndustrial and Commercial Bank of China Ltd ICBC
Priority to CN202311320057.5ApriorityCriticalpatent/CN117313721A/en
Publication of CN117313721ApublicationCriticalpatent/CN117313721A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The application provides a document management method and device based on a natural language processing technology, which can be used in the financial field or other technical fields. The method comprises the following steps: inputting a target document into a trained document classification model to obtain the document type of the target document output by the document classification model; extracting content metadata from the target document by using a natural language processing technology; generating marking information of the target document according to the document type and the content type metadata; and correspondingly storing the marking information and the target document into a document library. According to the document management method and device based on the natural language processing technology, the efficiency and accuracy of document management can be greatly improved by utilizing an automation technology, time and cost are saved, manual errors and inconsistencies are reduced, and management cost is reduced.

Description

Document management method and device based on natural language processing technology
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a document management method and device based on a natural language processing technology.
Background
With the development of digital transformation and informatization of financial services, banks have become a center for managing and processing a large number of documents. These documents include contracts, reports, approval processes, etc., which need to be properly managed in order to support business decisions and daily operations. However, manual management cannot cope with a large amount of unstructured data, needs to consume a large amount of time and effort, and has the problems of high subjectivity, difficulty in updating in real time, difficulty in coping with complex scenes, and the like.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the application provides a document management method and a document management device based on a natural language processing technology, which can at least partially solve the problems in the prior art.
In one aspect, the present application proposes a document management method based on a natural language processing technology, including:
inputting a target document into a trained document classification model to obtain the document type of the target document output by the document classification model;
extracting content metadata from the target document by using a natural language processing technology;
generating marking information of the target document according to the document type and the content type metadata;
And correspondingly storing the marking information and the target document into a document library.
In some embodiments, the extracting content-type metadata in the target document using natural language processing techniques includes:
inputting the target document into a text vectorization model to obtain vectorization representation of the target document output by the text vectorization model;
acquiring word frequency of each word in the target document;
according to the word frequency of each word, determining candidate words in the words;
obtaining a vectorized representation of each candidate word;
and determining content metadata of the target document according to the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document.
In some embodiments, said determining content-type metadata for said target document based on a similarity between said vectorized representation of each said candidate word and said vectorized representation of said target document comprises:
calculating the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document;
obtaining candidate words corresponding to the vectorized representations of which the similarity between the vectorized representations of the candidate words and the vectorized representations of the target document is greater than a first threshold value;
And determining content type metadata of the target document according to the obtained candidate words.
In some embodiments, the determining content-type metadata of the target document according to the obtained candidate word includes:
determining the candidate word with highest similarity between the vectorized representation of the obtained candidate word and the vectorized representation of the target document as content metadata of the target document;
and traversing other candidate words except the content metadata in the obtained candidate words in sequence, calculating the similarity between the vectorized representation of each other candidate word and the vectorized representation of the determined content metadata of the target document, and determining the other candidate word as the content metadata of the target document if the similarity between the vectorized representation of the other candidate word and the vectorized representation of each content metadata is smaller than a second threshold value.
In some embodiments, the inputting the target document into the trained document classification model includes:
preprocessing a target document, wherein the preprocessing comprises special character removal, punctuation mark removal, stop word removal, word stem processing and/or word line reduction processing;
Inputting the preprocessed target document into a trained document classification model;
the determining the candidate words in the words according to the word frequency of each word comprises the following steps:
and for each word, if the word frequency of the word is greater than a third threshold value, determining the word as a candidate word.
In some embodiments, the document classification model is trained on a Bert model using classified documents.
In some embodiments, the method further comprises:
acquiring a document query request, wherein the document query request comprises a query field, and the query field comprises a document type and/or document content metadata;
searching a target document with a query field in the marking information in a document library according to the document query request;
and sending the target document.
In another aspect, the present application proposes a document management apparatus based on a natural language processing technique, including:
the input module is used for inputting a target document into the trained document classification model to obtain the document type of the target document output by the document classification model;
the extraction module is used for extracting content metadata from the target document by utilizing a natural language processing technology;
The generation module is used for generating the marking information of the target document according to the document type and the content type metadata;
and the storage module is used for correspondingly storing the marking information and the target document into a document library.
In some embodiments, the extraction module is specifically configured to:
inputting the target document into a text vectorization model to obtain vectorization representation of the target document output by the text vectorization model;
acquiring word frequency of each word in the target document;
according to the word frequency of each word, determining candidate words in the words;
obtaining a vectorized representation of each candidate word;
and determining content metadata of the target document according to the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document.
In some embodiments, the extracting module determining content-type metadata of the target document based on a similarity between the vectorized representation of each of the candidate words and the vectorized representation of the target document comprises:
calculating the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document;
Obtaining candidate words corresponding to the vectorized representations of which the similarity between the vectorized representations of the candidate words and the vectorized representations of the target document is greater than a first threshold value;
and determining content type metadata of the target document according to the obtained candidate words.
In some embodiments, the extracting module determining content-type metadata of the target document according to the obtained candidate word includes:
determining the candidate word with highest similarity between the vectorized representation of the obtained candidate word and the vectorized representation of the target document as content metadata of the target document;
and traversing other candidate words except the content metadata in the obtained candidate words in sequence, calculating the similarity between the vectorized representation of each other candidate word and the vectorized representation of the determined content metadata of the target document, and determining the other candidate word as the content metadata of the target document if the similarity between the vectorized representation of the other candidate word and the vectorized representation of each content metadata is smaller than a second threshold value.
In some embodiments, the input module is specifically configured to:
Preprocessing a target document, wherein the preprocessing comprises special character removal, punctuation mark removal, stop word removal, word stem processing and/or word line reduction processing;
inputting the preprocessed target document into a trained document classification model;
the determining the candidate words in the words according to the word frequency of each word comprises the following steps:
and for each word, if the word frequency of the word is greater than a third threshold value, determining the word as a candidate word.
In some embodiments, the document classification model is trained on a Bert model using classified documents.
In some embodiments, the apparatus further comprises:
the acquisition module is used for acquiring a document query request, wherein the document query request comprises a query field, and the query field comprises a document type and/or document content type metadata;
the searching module is used for searching the target document with the query field in the marking information in the document library according to the document query request;
and the sending module is used for sending the target document.
The embodiment of the application also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the document management method based on the natural language processing technology according to any one of the embodiments when executing the program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the document management method based on natural language processing technology described in any of the above embodiments.
According to the document management method and device based on the natural language processing technology, the document type of the target document output by the document classification model is obtained by inputting the target document into the trained document classification model; extracting content metadata from the target document by using a natural language processing technology; generating marking information of the target document according to the document type and the content type metadata; and correspondingly storing the marking information and the target document into a document library. Therefore, the efficiency and accuracy of document management can be greatly improved by utilizing an automation technology, time and cost are saved, manual errors and inconsistencies are reduced, and management cost is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
Fig. 1 is a schematic flow chart of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 2 is a partial flow diagram of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 3 is a partial flow chart of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 4 is a partial flow diagram of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 5 is a partial flow diagram of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 6 is a partial flow diagram of a document management method based on natural language processing technology according to an embodiment of the present application.
Fig. 7 is a flowchart of a document management method based on a natural language processing technology according to an embodiment of the present application.
Fig. 8 is a schematic flowchart of an algorithm for keyword extraction according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of a document management apparatus based on a natural language processing technology according to an embodiment of the present application.
Fig. 10 is a schematic physical structure of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. The illustrative embodiments of the present application and their description are presented herein to illustrate the application and not to limit the application. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily ordered with each other.
The terms "first," "second," … …, and the like, as used herein, do not denote a particular order or sequence, nor are they intended to limit the application solely to distinguish one element or operation from another in the same technical terms.
As used herein, the terms "comprising," "including," "having," "containing," and the like are intended to be inclusive and mean an inclusion, but not limited to.
As used herein, "and/or" includes any or all ordering of such things.
For a better understanding of the present application, the following detailed description is given of the research background of the present application.
With the continuous expansion of banking scale, banking document management faces more and more problems and challenges. One of the main problems is how to quickly and accurately sort and extract key information in documents to support business decisions and daily operations of banks. To address these issues, banks need to employ more intelligent methods, such as automated document classification and keyword extraction using natural language processing techniques.
Natural language processing is an interdisciplinary in the fields of computer science, artificial intelligence, linguistics, etc., which can help banks automatically process and understand human language through computers. Using natural language processing techniques, banks can extract valuable information from documents faster, more accurately, and more intelligently. This information can be used to support business decisions and daily operations for banks. Automated document classification and keyword extraction are two important directions of application for natural language processing techniques.
The document classification can automatically classify a large number of documents according to the attributes of the topics, the types and the like, thereby facilitating the management and the retrieval of the documents by banks. The key word extraction can automatically extract important key words and phrases from the document, help the bank to better understand the document content, and can be used for document retrieval and semantic analysis. By utilizing the automatic document classification and keyword extraction technology, the bank can greatly reduce the labor cost, improve the consistency and accuracy of data and provide powerful support for the digital transformation of the bank. Therefore, banks need to pay attention to the application of natural language processing technology, and increasingly explore more advanced document management methods to improve business efficiency and competitiveness.
The execution subject of the document management method based on the natural language processing technology provided in the embodiments of the present application includes, but is not limited to, a computer.
Fig. 1 is a flow chart of a document management method based on a natural language processing technology according to an embodiment of the present application, as shown in fig. 1, where the document management method based on a natural language processing technology according to an embodiment of the present application includes:
s101, inputting a target document into a trained document classification model to obtain the document type of the target document output by the document classification model;
in step S101, text classification is a key step of managing a document. The basic idea is to train a classification model through a machine learning algorithm, and the model can automatically classify documents into different categories. This requires training the classifier using a labeled dataset that includes documents that have been classified into various categories. Once the model training is complete, it can be applied to unlabeled documents to automatically classify them. In some embodiments, the document classification model is trained on a Bert model using classified documents. Specifically, the present invention relates to a method for manufacturing a semiconductor device. For text material in bank documents, such as contracts, invoices, reports, etc., a chinese Bert model may be used for classification tasks. Wherein BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing pre-training model that can be used for text classification tasks. Based on a transducer architecture, the method can capture context and semantic information in sentences and generate a high-dimensional vector representation. In the banking field, BERT may be used to sort documents for contracts, invoices, reports, and the like. In order to use BERT for text classification, the marker dataset needs to be converted into a format usable by BERT. One common approach is to convert each document into an average of its corresponding word-embedded vectors, which are then input into a classifier for training. For banks, knowledge in the bank domain is required to select a proper pre-training model and classifier to obtain a better classification effect, and by automatically classifying documents, the banks can manage and analyze a large amount of text data more effectively and accelerate business processes.
S102, extracting content metadata from a target document by using a natural language processing technology;
in step S102, the content-type metadata is used to describe the content of the target document. Content-type metadata extraction is another key step in automated document management. The method can automatically extract keywords and phrases from a large amount of text data, and can help banks to quickly know document contents and topics. These keywords and phrases can be used as metadata of the documents, facilitating searching and classification of the documents. This step mainly involves identifying and extracting keywords and phrases in the document.
S103, generating marking information of the target document according to the document type and the content type metadata;
in step S103, for a bank, a large number of bank documents, such as contracts, financial reports, transaction records, etc., need to be properly categorized, indexed, and archived for later retrieval and management. To achieve these functions, an index system may be built using full text search engines, such as Apache Lucene-based Solr. Metadata (e.g., title, author, date) and tags for a document are added to the index by associating them with the document. Thus, the retrieval efficiency and accuracy of the document can be improved. For document tagging, a tag may be added to each document according to the classification of the document and the result of keyword extraction. These tags may also include information about the category, importance level, relevant departments, etc. of the document. By marking, documents may be better organized and managed.
S104, correspondingly storing the marking information and the target document into a document library.
In step S104, for archiving the document, an appropriate storage medium, such as a file system or database, is selected. And establishing a directory structure or a file naming rule according to requirements, so that the document can be archived in a certain organization mode. This ensures secure storage and quick access of the document. Finally, a user interface or API may be established so that the user may retrieve and manage documents according to categories, keywords, or other conditions. The user can quickly find the required document through the search function and perform operations such as checking, editing or deleting. At the same time, an access control mechanism may be added to ensure that only authorized personnel can access the sensitive document. The document output module realizes the automatic processing and the efficient management of a large number of documents of banks through the functions of document marking, indexing, archiving, retrieving and managing.
According to the document management method based on the natural language processing technology, a target document is input into a trained document classification model, and the document type of the target document output by the document classification model is obtained; extracting content metadata from the target document by using a natural language processing technology; generating marking information of the target document according to the document type and the content type metadata; and correspondingly storing the marking information and the target document into a document library. Therefore, the efficiency and accuracy of document management can be greatly improved by utilizing an automation technology, time and cost are saved, manual errors and inconsistencies are reduced, and management cost is reduced.
As shown in fig. 2, in some embodiments, the extracting content-type metadata in the target document using natural language processing techniques includes:
s1021, inputting the target document into a text vectorization model to obtain vectorization representation of the target document output by the text vectorization model;
s1022, acquiring word frequency of each word in the target document;
s1023, determining candidate words in the words according to word frequency of each word;
s1024, obtaining vectorization representation of each candidate word;
s1025, determining content metadata of the target document according to the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document.
In particular, the BERT model may be used for text vectorization: the BERT model breaks text into a series of words or subwords, which are then converted into embedded vectors. The embedding vectors include Word embedding (Word embedding) and position embedding (Position Embeddings) for representing semantic information and position information of words. In the BERT model, a more deep semantic coding is obtained by stacking multiple convertor layers. The output of each transducer layer serves as the input of the next layer, thereby progressively extracting rich semantic features. In the output of the last transducer layer, the BERT model typically uses the output of a special CLS (class) tag as a representation vector for the entire sentence. In this way, the BERT model is able to translate the entered text into a vector representation with rich semantic information.
Word frequency statistics: in short, the number of occurrences of each word in the document is counted. And determining the word with the higher word frequency as a candidate word, for example, in some embodiments, the determining the candidate word in the words according to the word frequency of each word includes: and for each word, if the word frequency of the word is greater than a third threshold value, determining the word as a candidate word. For example, the third threshold is 10, 15, 20, etc.
Obtaining vectorized representations of candidate words and documents by a BERT model: obtaining a vector representation of the candidate word using the BERT model: each candidate word is prepared in a format suitable for the input of the BERT model, and its vector representation is obtained by the BERT model. Obtaining a vector representation of a document using a BERT model: the document is prepared in a format suitable for the input of the BERT model by which its vector representation is obtained.
Similarity calculation and content metadata acquisition: by calculating the vector similarity between the candidate word and the document, we can more accurately select the candidate word most relevant to the document as the output of the key word. The output keywords are the content metadata of the target document.
As shown in fig. 3, in some embodiments, the determining the content-type metadata of the target document according to the similarity between the vectorized representation of each of the candidate words and the vectorized representation of the target document includes:
S10251, calculating the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document;
s10252, obtaining candidate words corresponding to the vectorized representations of which the similarity between the vectorized representations of the candidate words and the vectorized representations of the target document is greater than a first threshold;
s10253, determining content type metadata of the target document according to the obtained candidate words.
Specifically, candidate words with high similarity with the target document are selected from the candidate words, and part or all of the candidate words are selected as content metadata of the target document.
As shown in fig. 4, in some embodiments, the determining content-type metadata of the target document according to the obtained candidate word includes:
s102531, determining the candidate word with highest similarity between the vectorized representation of the obtained candidate words and the vectorized representation of the target document as content metadata of the target document;
s102532, traversing other candidate words except the content metadata in the obtained candidate words in sequence, calculating the similarity between the vectorized representation of each other candidate word and the vectorized representation of the determined content metadata of the target document, and determining the other candidate word as the content metadata of the target document if the similarity between the vectorized representation of the other candidate word and the vectorized representation of each content metadata is smaller than a second threshold value.
Specifically, firstly, determining a candidate word with highest similarity with a target document as content metadata of the target document, then, combining algorithms such as Maximal Marginal Relevance (MMR) and Max Sum Distance, respectively selecting candidate words with high similarity with the document but low similarity with the selected keywords from the rest candidate words, and finally, taking coincident keywords selected by the two algorithms as the content metadata of the target document. The MMR algorithm selects candidate words with high similarity to the document and low similarity to the selected keywords by balancing the relevance and diversity in the keyword list, so that the keywords can be ensured to contain words closely related to the document and keep certain diversity. The Max Sum Distance algorithm maintains the diversity of keywords by selecting the vocabulary with the greatest Distance between the candidate word and the selected keyword.
As shown in FIG. 5, in some embodiments, the inputting the target document into the trained document classification model includes:
s1011, preprocessing a target document, wherein the preprocessing comprises removing special characters, removing punctuation marks, removing stop words, performing word stem processing and/or performing word line reduction processing;
S1012, inputting the preprocessed target document into a trained document classification model.
In particular, text preprocessing may be used as a first step in automated document management. Text preprocessing is a critical step aimed at cleaning and normalizing raw text data to enhance the effectiveness of subsequent feature extraction and text analysis. First, special characters and punctuation are removed, for example, various special characters and punctuation may be included in the document, such as @, #, $, etc., which are not meaningful for subsequent feature extraction and need to be removed from the text. Next, removal of stop words, which are common words that frequently occur in text but lack actual meaning, such as "yes" and "yes", etc., is performed, and these words do not contribute to information management of a document and extraction of key information, and thus can be removed during text preprocessing. Then, word drying and word shape reduction are carried out, and words are converted into basic forms, so that influence of word shape change on feature extraction is reduced. Through the detailed text preprocessing steps, the document can obtain the text data after washing and normalization, the steps are helpful for eliminating noise in the text, reducing interference of language forms on feature extraction and information extraction, providing a more reliable and accurate basis for subsequent feature extraction and text analysis, and being helpful for better managing and utilizing information in the bank document.
Through text preprocessing, the document can obtain clean and normative text data, and a more reliable basis is provided for subsequent analysis and modeling. Feature extraction is a critical step for extracting useful information from the preprocessed text to support text classification, key information recognition, etc. In document information management, feature extraction using a pre-trained model is a suitable choice. Pre-trained models such as BERT perform well in natural language processing tasks and have strong semantic understanding and context understanding capabilities.
Banking documents often contain a large amount of terms of art, financial concepts, and domain-specific information that may not be well captured by conventional bag-of-words models or N-gram models. The pre-training model can learn rich semantic representations by training on a large-scale corpus, and can better understand and process texts in the specific fields. By using a pre-trained model, a representation of its hidden layer or output layer can be obtained as a feature in the text input model. The representations retain semantic information of the text and can be used for tasks such as text classification, key information identification and the like. In addition, the pre-trained model may be fine-tuned to further improve performance on a particular task. Feature extraction methods help convert text into machine-understandable representations to support classification of banking documents, information extraction, and more advanced text analysis. The text preprocessing is a vital ring in the management of the bank document information, and the accuracy and the efficiency of text classification and keyword extraction can be improved by preprocessing and cleaning the text, so that the quality and the effect of the document information management are improved.
As shown in fig. 6, in some embodiments, the method further comprises:
s105, acquiring a document query request, wherein the document query request comprises a query field, and the query field comprises document type and/or document content type metadata;
s106, searching a target document with a query field in the marking information in a document library according to the document query request;
s107, the target document is sent.
In particular, a user may query a document based on the document type, document content type metadata, or other criteria. And after receiving the query request, querying a document which hits the query field in a document library according to the query field in the query request, and returning the document to the client.
For a better understanding of the present application, a detailed description of a document management method based on natural language processing technology provided in the present application is provided below in a specific embodiment.
Fig. 7 is a flowchart of a document management method based on a natural language processing technology according to an embodiment of the present application, and as shown in fig. 7, the document management method based on a natural language processing technology according to an embodiment of the present application includes:
s201, text processing is the first step of automated document classification and feature extraction. Text preprocessing is a critical step aimed at cleaning and normalizing raw text data to enhance the effectiveness of subsequent feature extraction and text analysis. In bank document information management. First, special characters and punctuations are removed, and various special characters and punctuations, such as @, #, $, etc., may be contained in the bank document, which have no meaning for subsequent feature extraction and need to be removed from the text. Next, removal of stop words, which are common words that frequently occur in text but lack actual meaning, such as "yes" and "yes", etc., is performed, and these words do not contribute to information management of a bank document and extraction of key information, and thus can be removed during text preprocessing. Then, word drying and word shape reduction are carried out, and words are converted into basic forms, so that influence of word shape change on feature extraction is reduced. Through the detailed text preprocessing steps, the bank document can obtain the text data which is subjected to cleaning and standardization, the steps are helpful for eliminating noise in the text, reducing interference of language forms on feature extraction and information extraction, providing a more reliable and accurate basis for subsequent feature extraction and text analysis, and being helpful for better managing and utilizing information in the bank document.
Through text preprocessing, the bank document can obtain clean and normative text data, and a more reliable basis is provided for subsequent analysis and modeling. Feature extraction is a critical step for extracting useful information from the preprocessed text to support text classification, key information recognition, etc. In bank document information management, feature extraction using a pre-trained model is a suitable choice. Pre-trained models such as BERT perform well in natural language processing tasks and have strong semantic understanding and context understanding capabilities. Banking documents often contain a large amount of terms of art, financial concepts, and domain-specific information that may not be well captured by conventional bag-of-words models or N-gram models. The pre-training model can learn rich semantic representations by training on a large-scale corpus, and can better understand and process texts in the specific fields. By using a pre-trained model, a representation of its hidden layer or output layer can be obtained as a feature in the text input model. The representations retain semantic information of the text and can be used for tasks such as text classification, key information identification and the like. In addition, the pre-trained model may be fine-tuned to further improve performance on a particular task. Feature extraction methods help convert text into machine-understandable representations to support classification of banking documents, information extraction, and more advanced text analysis. The text processing module is a vital ring in the management of the bank document information, and can improve the accuracy and the efficiency of text classification and keyword extraction by preprocessing and cleaning the text, thereby improving the quality and the effect of the bank information management.
S202, text classification is a key step of automatically classifying documents. The basic idea is to train a classification model through a machine learning algorithm, and the model can automatically classify documents into different categories. This requires training the classifier using a labeled dataset that includes documents that have been classified into various categories. Once the model training is complete, it can be applied to unlabeled documents to automatically classify them. For text material in bank documents, such as contracts, invoices, reports, etc., a chinese Bert model may be used for classification tasks. Wherein BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing pre-training model that can be used for text classification tasks. Based on a transducer architecture, the method can capture context and semantic information in sentences and generate a high-dimensional vector representation. In the banking field, BERT may be used to sort documents for contracts, invoices, reports, and the like. In order to use BERT for text classification, the marker dataset needs to be converted into a format usable by BERT. One common approach is to convert each document into an average of its corresponding word-embedded vectors, which are then input into a classifier for training. Meanwhile, knowledge in the banking field is required to select a proper pre-training model and classifier to obtain a better classifying effect. By automatically classifying documents, banks can more effectively manage and analyze large amounts of text data and expedite business processes.
S203, text keyword extraction is another key step of automated document management. The method can automatically extract keywords and phrases from a large amount of text data and help banks to quickly know document contents and topics. These keywords and phrases can be used as metadata of the documents, facilitating searching and classification of the documents. The module is primarily concerned with identifying and extracting keywords and phrases in documents. These pre-trained language models have been trained on large-scale text datasets and perform well in a variety of natural language processing tasks. The BERT model is particularly prominent in the aspect of processing keyword extraction tasks in natural language texts. Keywords are extracted by using a transducer model and a BERT pre-training model. The transducer model is a neural network structure based on a self-attention mechanism, which is excellent in processing sequence data and can effectively capture context information. BERT is a transform-based pre-training model that enables rich text representations to be generated by pre-training on a large scale of text corpus. First, the BERT extraction document embedding is used to obtain a document-level representation. Word embedding of N-gram words/phrases is then extracted. Finally, cosine similarity is used to find the most similar word/phrase to the document. The most similar word may be identified as the word that best describes the entire document. The method can extract keywords based on various convertors and Sentence convertors models and ZhKeyBERT Chinese Bert for Chinese fine tuning. By introducing maximum marginal relevance (Maximal Marginal Relevance, MMR) information retrieval techniques, it is intended to select a collection of documents that are relevant to a given query and that have diversity. Max Sum Distance (Max Sum Distance) combines relevance and diversity by maximizing the Sum of distances between selected documents based on a variant of MMR. The most relevant documents are first selected based on a relevance scoring function. Then, the document having the highest diversity score is selected from the remaining documents, taking into account the differences between the selected document and the previously selected document. This process proceeds iteratively, with the goal of balancing the correlation against diversity. By maximizing the sum of distances between selected documents, the diversity and relevance of keyword extraction is ensured.
Text vectorization using the Bert model: BERT breaks text into a series of words or sub-words, which are then converted into embedded vectors. The embedding vectors include Word embedding (Word embedding) and position embedding (Position Embeddings) for representing semantic information and position information of words. In the BERT model, a more deep semantic coding is obtained by stacking multiple convertor layers. The output of each transducer layer serves as the input of the next layer, thereby progressively extracting rich semantic features. In the output of the last transducer layer, the BERT model typically uses the output of a special CLS (class) tag as a representation vector for the entire sentence. In this way, the BERT model is able to translate the entered text into a vector representation with rich semantic information.
Vectorization of document word frequency and word frequency statistics: word frequency vectorization of a document refers to representing a document as a vector, where each dimension of the vector corresponds to a word in a vocabulary, and the frequency of occurrence of the word in the document is recorded. In short, the number of times each word in the document appears in the document is counted, the counts are taken as elements of vectors, word frequency statistics is carried out on text data through a countvector, and word frequency information of each word in each document is output.
Candidate words and document vectorization are obtained through a Bert model: obtaining a vector representation of the candidate word using the BERT model: each candidate word is prepared in a format suitable for the input of the BERT model, and its vector representation is obtained by the BERT model. Obtaining a vector representation of a document using a BERT model: the document is prepared in a format suitable for the input of the BERT model by which its vector representation is obtained.
Similarity calculation and keyword acquisition: by calculating the vector similarity between the candidate words and the document and combining algorithms such as Maximal Marginal Relevance (MMR) and Max Sum Distance, we can more accurately select the most relevant candidate words as the output of the keywords. The MMR algorithm selects candidate words that have high similarity to the document but low similarity to the selected keywords by balancing the relevance and diversity in the keyword list. Thus, the keywords can be ensured to contain words closely related to the document and keep a certain diversity. The Max Sum Distance algorithm maintains the diversity of keywords by selecting the vocabulary with the greatest Distance between the candidate word and the selected keyword. And selecting the most similar candidate words as keywords to be output. The algorithm flow of keyword extraction is shown in fig. 8.
S204, outputting the document is the last step of applying the classification and the keyword information to document management. In this module, key functions implemented include document tagging, indexing, archiving, retrieval and management. In the banking field, document information management is very important because banks need to process a large number of documents including contracts, financial reports, transaction records, and the like. In this module, a large number of banking documents, such as contracts, financial reports, transaction records, etc., need to be properly categorized, indexed, and archived for later retrieval and management. To achieve these functions, an index system may be built using full text search engines, such as Apache Lucene-based Solr. Metadata (e.g., title, author, date) and tags for a document are added to the index by associating them with the document. Thus, the retrieval efficiency and accuracy of the document can be improved. For document tagging, a tag may be added to each document according to the classification of the document and the result of keyword extraction. These tags may include information about the category, importance level, relevant departments, etc. of the document. By marking, documents may be better organized and managed. For archiving the document, an appropriate storage medium, such as a file system or database, is selected. And establishing a directory structure or a file naming rule according to requirements, so that the document can be archived in a certain organization mode. This ensures secure storage and quick access of the document. Finally, a user interface or API is created so that the user can retrieve and manage documents according to categories, keywords, or other criteria. The user can quickly find the required document through the search function and perform operations such as checking, editing or deleting. At the same time, an access control mechanism may be added to ensure that only authorized personnel can access the sensitive document. The document output module realizes the automatic processing and the efficient management of a large number of documents of banks through the functions of document marking, indexing, archiving, retrieving and managing.
By using an automated document management system, the bank can greatly improve the efficiency and accuracy of document processing, thereby saving time and labor costs and reducing the risk of human error. In addition, the automated document management system can also improve the visibility and traceability of the document by the bank, so that the document is easier to audit and check compliance.
The automatic document classification and keyword extraction technology provided by the embodiment of the application can greatly improve the efficiency and accuracy of bank document management. Conventional document management methods generally require manual classification, indexing and searching, consume a great deal of time and effort, and are prone to human errors and inconsistencies, affecting the quality of service and efficiency of the bank.
The technique may automatically identify and classify document types, such as contracts, invoices, reports, etc., and extract key information in the document. Through natural language processing and machine learning algorithms, the technology can carry out deep analysis on document contents, automatically extract keywords and phrases and generate corresponding labels and indexes so as to facilitate quick retrieval and management of documents. The technology can also help banks to realize intelligent management of documents. Through continuous learning and optimization, the system can continuously improve the accuracy and efficiency of document classification and keyword extraction, thereby improving the document management level of banks. The bank can greatly improve the efficiency and accuracy of document management by utilizing an automation technology, save time and cost, reduce manual errors and inconsistencies, improve the document retrieval efficiency, reduce the management cost, and further improve the competitiveness and the service quality of the bank.
Fig. 9 is a schematic structural diagram of a document management apparatus based on a natural language processing technology according to an embodiment of the present application, and as shown in fig. 9, the document management apparatus based on a natural language processing technology according to an embodiment of the present application includes:
the input module 31 is configured to input a target document into a trained document classification model, so as to obtain a document type of the target document output by the document classification model;
an extraction module 32 for extracting content-type metadata from the target document using natural language processing techniques;
a generating module 33, configured to generate tag information of the target document according to the document type and the content metadata;
and the storage module 34 is used for correspondingly storing the marking information and the target document into a document library.
According to the document management device based on the natural language processing technology, a target document is input into a trained document classification model, so that the document type of the target document output by the document classification model is obtained; extracting content metadata from the target document by using a natural language processing technology; generating marking information of the target document according to the document type and the content type metadata; and correspondingly storing the marking information and the target document into a document library. Therefore, the efficiency and accuracy of document management can be greatly improved by utilizing an automation technology, time and cost are saved, manual errors and inconsistencies are reduced, and management cost is reduced.
In some embodiments, the extraction module is specifically configured to:
inputting the target document into a text vectorization model to obtain vectorization representation of the target document output by the text vectorization model;
acquiring word frequency of each word in the target document;
according to the word frequency of each word, determining candidate words in the words;
obtaining a vectorized representation of each candidate word;
and determining content metadata of the target document according to the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document.
In some embodiments, the extracting module determining content-type metadata of the target document based on a similarity between the vectorized representation of each of the candidate words and the vectorized representation of the target document comprises:
calculating the similarity between the vectorized representation of each candidate word and the vectorized representation of the target document;
obtaining candidate words corresponding to the vectorized representations of which the similarity between the vectorized representations of the candidate words and the vectorized representations of the target document is greater than a first threshold value;
and determining content type metadata of the target document according to the obtained candidate words.
In some embodiments, the extracting module determining content-type metadata of the target document according to the obtained candidate word includes:
determining the candidate word with highest similarity between the vectorized representation of the obtained candidate word and the vectorized representation of the target document as content metadata of the target document;
and traversing other candidate words except the content metadata in the obtained candidate words in sequence, calculating the similarity between the vectorized representation of each other candidate word and the vectorized representation of the determined content metadata of the target document, and determining the other candidate word as the content metadata of the target document if the similarity between the vectorized representation of the other candidate word and the vectorized representation of each content metadata is smaller than a second threshold value.
In some embodiments, the input module is specifically configured to:
preprocessing a target document, wherein the preprocessing comprises special character removal, punctuation mark removal, stop word removal, word stem processing and/or word line reduction processing;
inputting the preprocessed target document into a trained document classification model;
The determining the candidate words in the words according to the word frequency of each word comprises the following steps:
and for each word, if the word frequency of the word is greater than a third threshold value, determining the word as a candidate word.
In some embodiments, the document classification model is trained on a Bert model using classified documents.
In some embodiments, the apparatus further comprises:
the acquisition module is used for acquiring a document query request, wherein the document query request comprises a query field, and the query field comprises a document type and/or document content type metadata;
the searching module is used for searching the target document with the query field in the marking information in the document library according to the document query request;
and the sending module is used for sending the target document.
The embodiment of the apparatus provided in the present application may be specifically used to execute the processing flow of the above method embodiment, and the functions thereof are not described herein again, and may refer to the detailed description of the above method embodiment.
It should be noted that, the document management method and device based on the natural language processing technology provided in the embodiments of the present application may be used in the financial field, and may also be used in any technical field other than the financial field.
Fig. 10 is a schematic physical structure of an electronic device according to an embodiment of the present application, as shown in fig. 10, the electronic device may include: a processor (processor) 401, a communication interface (Communications Interface) 402, a memory (memory) 403 and a communication bus 404, wherein the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404. The processor 401 may call logic instructions in the memory 403 to perform the method according to any of the embodiments described above, for example including: acquiring voice data of a target field; inputting the voice data into a pre-trained cross-domain voiceprint recognition model to obtain a recognition result output by the cross-domain voiceprint recognition model, wherein the cross-domain voiceprint recognition model is obtained by training according to voice data of a source domain and approximate voice data of a target domain, the approximate voice data of the target domain is obtained by inputting text data of the target domain into a trained multi-speaker voice synthesis model, and the multi-speaker voice synthesis model is obtained by training by utilizing the voice data of the source domain and the text data corresponding to the voice data.
Further, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising; acquiring voice data of a target field; inputting the voice data into a pre-trained cross-domain voiceprint recognition model to obtain a recognition result output by the cross-domain voiceprint recognition model, wherein the cross-domain voiceprint recognition model is obtained by training according to voice data of a source domain and approximate voice data of a target domain, the approximate voice data of the target domain is obtained by inputting text data of the target domain into a trained multi-speaker voice synthesis model, and the multi-speaker voice synthesis model is obtained by training by utilizing the voice data of the source domain and the text data corresponding to the voice data.
The present embodiment provides a computer-readable storage medium storing a computer program that causes the computer to execute the methods provided by the above-described method embodiments, for example, including: acquiring voice data of a target field; inputting the voice data into a pre-trained cross-domain voiceprint recognition model to obtain a recognition result output by the cross-domain voiceprint recognition model, wherein the cross-domain voiceprint recognition model is obtained by training according to voice data of a source domain and approximate voice data of a target domain, the approximate voice data of the target domain is obtained by inputting text data of the target domain into a trained multi-speaker voice synthesis model, and the multi-speaker voice synthesis model is obtained by training by utilizing the voice data of the source domain and the text data corresponding to the voice data.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In the description of the present specification, reference to the terms "one embodiment," "one particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present application and are not meant to limit the scope of the invention, but to limit the scope of the invention.

Claims (10)

CN202311320057.5A2023-10-122023-10-12Document management method and device based on natural language processing technologyPendingCN117313721A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311320057.5ACN117313721A (en)2023-10-122023-10-12Document management method and device based on natural language processing technology

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311320057.5ACN117313721A (en)2023-10-122023-10-12Document management method and device based on natural language processing technology

Publications (1)

Publication NumberPublication Date
CN117313721Atrue CN117313721A (en)2023-12-29

Family

ID=89246081

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311320057.5APendingCN117313721A (en)2023-10-122023-10-12Document management method and device based on natural language processing technology

Country Status (1)

CountryLink
CN (1)CN117313721A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119904205A (en)*2025-04-022025-04-29中国水利水电第五工程局有限公司 Process management methods, systems, equipment and media for enterprise patent application and approval

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119904205A (en)*2025-04-022025-04-29中国水利水电第五工程局有限公司 Process management methods, systems, equipment and media for enterprise patent application and approval

Similar Documents

PublicationPublication DateTitle
AU2019263758B2 (en)Systems and methods for generating a contextually and conversationally correct response to a query
CN112836509B (en)Expert system knowledge base construction method and system
CN113569050B (en)Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111767716B (en)Method and device for determining enterprise multi-level industry information and computer equipment
CN112818093A (en)Evidence document retrieval method, system and storage medium based on semantic matching
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
US20220358379A1 (en)System, apparatus and method of managing knowledge generated from technical data
CN113934909A (en)Financial event extraction method based on pre-training language and deep learning model
CN113971210B (en)Data dictionary generation method and device, electronic equipment and storage medium
CN118278365A (en) Automatic generation method and device for scientific literature review
AU2021444983A1 (en)System and method of automatic topic detection in text
WO2021190662A1 (en)Medical text sorting method and apparatus, electronic device, and storage medium
CN111291168A (en) Book retrieval method, device and readable storage medium
CN118245564B (en)Method and device for constructing feature comparison library supporting semantic review and repayment
CN119025672A (en) A labeling system construction method based on large language model
CN117574858A (en)Automatic generation method of class case retrieval report based on large language model
CN109190112B (en) Patent classification method, system and storage medium based on dual-channel feature fusion
CN114491079A (en)Knowledge graph construction and query method, device, equipment and medium
Dawar et al.Comparing topic modeling and named entity recognition techniques for the semantic indexing of a landscape architecture textbook
WO2019246252A1 (en)Systems and methods for identifying and linking events in structured proceedings
CN119988562A (en) A question-answering method, device, equipment and medium based on a large model
CN117313721A (en)Document management method and device based on natural language processing technology
CN118503454B (en)Data query method, device, storage medium and computer program product
CN114254622A (en)Intention identification method and device
Zhang et al.Word embedding-based web service representations for classification and clustering

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp