Movatterモバイル変換


[0]ホーム

URL:


CN119047437B - Text correction method and device - Google Patents

Text correction method and device
Download PDF

Info

Publication number
CN119047437B
CN119047437BCN202411546772.5ACN202411546772ACN119047437BCN 119047437 BCN119047437 BCN 119047437BCN 202411546772 ACN202411546772 ACN 202411546772ACN 119047437 BCN119047437 BCN 119047437B
Authority
CN
China
Prior art keywords
word
vector
idf
document
word2vec
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411546772.5A
Other languages
Chinese (zh)
Other versions
CN119047437A (en
Inventor
舒忠磊
黄艳艳
江焘
程晖
黎钢
唐湘茜
刘媛
江文
李晗
高小雲
郭甜甜
张爽
马莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changjiang Water Resources Commission Network And Information Center
Original Assignee
Changjiang Water Resources Commission Network And Information Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changjiang Water Resources Commission Network And Information CenterfiledCriticalChangjiang Water Resources Commission Network And Information Center
Priority to CN202411546772.5ApriorityCriticalpatent/CN119047437B/en
Publication of CN119047437ApublicationCriticalpatent/CN119047437A/en
Application grantedgrantedCritical
Publication of CN119047437BpublicationCriticalpatent/CN119047437B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention provides a text correction method and a text correction device, wherein the method comprises the steps of preprocessing a current document to construct a vocabulary of the current document, inputting the vocabulary of the current document into a Word2Vec model which is trained in advance, extracting Word2Vec Word vectors of each Word in the vocabulary, calculating TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set to construct TF-IDF feature vectors of each Word, carrying out vector feature fusion on the TF-IDF feature vectors of each Word and the Word2Vec Word vectors to form vector representation of each Word, matching corresponding target error words in a preset database according to the vector representation of each Word, and replacing each Word in the current document by using standard words corresponding to the target error words to realize text correction of the current document. The invention greatly improves the text editing efficiency and effectively reduces the editing quality error rate of the text.

Description

Text correction method and device
Technical Field
The invention relates to the technical field of automatic detection and editing of computers, in particular to a text correction method and device.
Background
Editing and processing of scientific papers is an important link of editing work, and is mainly divided into 3 aspects of content processing, technical processing and language processing. At present, conventional editing software on the market has error checking and correcting functions, but the conventional editing software is mainly modified in terms of paper language logic, wrongly written characters, symbols and the like, and cannot be used in the professional technology, a basic database is relatively fixed, and a user is difficult to modify according to own requirements, so that the development of editing and correcting work is restricted to a certain extent.
For scientific papers, technical processing is often key and difficult, the problems of academic and scientific aspects are more, the knowledge surface is wider, editors are difficult to master in a short time, and generally, experience (including examination and reading meeting, manuscript convention rule, industry standard, expert suggestion and the like, hereinafter referred to as "experience rule") summarized in a long-term run a periodical process is relied on.
Therefore, how to use the informatization technology to apply the experience rules to the scientific paper editing to realize the intelligent detection and correction of the paper, thereby improving the working efficiency of editors has a certain practical meaning.
Disclosure of Invention
The invention provides a text correction method and a text correction device, which are used for solving the defect of lower working efficiency of editors in the prior art.
The invention provides a text correction method, which comprises the steps of preprocessing a current document to construct a vocabulary of the current document, inputting the vocabulary of the current document into a Word2Vec model which is trained in advance, extracting Word2Vec Word vectors of each Word in the vocabulary, calculating TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set, constructing TF-IDF feature vectors of each Word, carrying out vector feature fusion on the TF-IDF feature vectors of each Word and the Word2Vec Word vectors to form vector representation of each Word, matching corresponding target error words in a preset database according to the vector representation of each Word, wherein the preset database comprises a plurality of error words and standard words which are in one-to-one correspondence, the error words are words which do not accord with preset standards, the preset standards are words which are determined according to industry standards and expert rules, and carrying out text correction on each Word in the current document by using the corresponding target error words.
The text correction method provided by the invention further comprises the step of adjusting the Word2Vec model and/or the vector feature fusion mode according to the matching result and/or the replacing result.
According to the text correction method provided by the invention, the current document is preprocessed to construct the vocabulary of the current document, and the method comprises the steps of text cleaning and word segmentation of the current document to construct the vocabulary of the current document.
According to the text correction method provided by the invention, under a preset document set, a TF-IDF value corresponding to each vocabulary in each document is calculated by using a TF-IDF algorithm, and a TF-IDF feature vector of each vocabulary is constructed, wherein the method comprises the steps of calculating word frequency TF of each vocabulary in a vocabulary of each document, and calculating inverse document frequency IDF of each vocabulary in the preset document set; according to the word frequency TF of each word in the vocabulary of each document and the inverse document frequency IDF in the preset document set, calculating a TF-IDF value corresponding to each word in each document; and constructing TF-IDF characteristic vectors of each vocabulary according to the TF-IDF values corresponding to each vocabulary in each document.
According to the text correction method, vector feature fusion is conducted on the TF-IDF feature vector and the Word2Vec Word vector of each Word to form vector representation of each Word, wherein the vector feature fusion comprises the steps of conducting dimension unification on the TF-IDF feature vector and the Word2Vec Word vector of each Word, conducting weighted feature fusion on the TF-IDF feature vector and the Word2Vec Word vector to form a weighted fusion vector, and conducting normalization processing on the weighted fusion vector to generate vector representation of each Word.
According to the text correction method, corresponding target error words are matched in a preset database according to the vector representation of each word, the method comprises the steps of calculating the similarity between the vector representation of each word and the vector representation of the error word in the preset database, wherein the vector representation of the error word is built in advance, and the error word with the largest similarity is used as the target error word when the error word with the similarity larger than a preset similarity threshold value exists in the preset database.
According to the text correction method provided by the invention, the operation of preprocessing the current document further comprises at least one of the following operations of removing stop words, extracting word stems, labeling parts of speech and converting into lower case operations.
In a second aspect, the present invention also provides a text correction apparatus, including:
The document processing module is used for preprocessing the current document and constructing a vocabulary of the current document;
The feature extraction module is used for inputting a vocabulary of a current document into a Word2Vec model which is trained in advance, extracting Word2Vec Word vectors of each Word in the vocabulary, calculating TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set, and constructing TF-IDF feature vectors of each Word;
The feature fusion module is used for carrying out vector feature fusion on the TF-IDF feature vector and Word2Vec Word vector of each Word to form vector representation of each Word;
the matching module is used for matching corresponding target error words in a preset database according to the vector representation of each word, wherein the preset database comprises a plurality of error words and standard words which are in one-to-one correspondence, the error words are words which do not accord with a preset standard, the standard words are words which accord with the preset standard, and the preset standard is determined according to an industry standard and expert rules;
and the text replacement correction module is used for replacing each word by using the standard word corresponding to the target error word so as to realize the text correction of the current document.
In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any one of the text correction methods described above when the program is executed.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a text correction method as described in any of the above.
The text correction method and the text correction device integrate experience rules in the text editing process, pre-establish a high-quality error word-normative word database, change the traditional method from manual editing (greatly influenced by individual learning and memory capacity) to an integrated computing system and automatically complete the text correction method and the text correction device, integrate originally scattered, multi-element and complex text editing work into a computer program through data integration, maximally utilize the experience rules, effectively avoid the loss of work results, and help guiding editing beginners to quickly grasp key points, reduce the editing difficulty of scientific papers, greatly improve the text editing efficiency and reduce the editing and correction quality error rate of texts.
The invention not only considers the semantic information of the Word (through Word2Vec model), but also considers the importance of the Word in specific context (through TF-IDF value), so that the correction result is more accurate and reasonable.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, a brief description will be given below of the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a text correction method provided by the present invention;
FIG. 2 is a second flow chart of the text correction method according to the present invention;
FIG. 3 is a block diagram of a scientific paper normalization intelligent detection and correction system according to the invention;
FIG. 4 is a schematic diagram of a text correction apparatus according to the present invention;
Fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The text correction method and apparatus provided by the embodiment of the present invention are described below with reference to fig. 1 to 5.
FIG. 1 is a schematic flow chart of a text correction method according to the present invention, as shown in FIG. 1, including but not limited to the following steps:
step 101, preprocessing the current document to construct a vocabulary of the current document.
The current document may be a Word document (in the present invention, a scientific paper is generally referred to), and optionally, text cleaning and Word segmentation are performed on the current document to construct a vocabulary of the current document, which may, of course, be performed in other operations to obtain a clean vocabulary. The following is an example of building a vocabulary for a current document:
(1) Text cleaning, namely removing irrelevant characters in the current document, such as special symbols, punctuation marks and the like, and only retaining text information.
(2) Word segmentation, namely dividing continuous text character strings into separate words or terms.
(3) The stop words are words which frequently appear in the text but do not greatly contribute to the meaning of the text, such as 'yes' and 'in', and the like, and the noise of subsequent processing can be reduced by removing the words from the text.
(4) Stem extraction-for some languages, such as english, it may be desirable to reduce the vocabulary to a basic form, e.g., to convert "running" to "run".
(5) Part of speech tagging, namely, marking the part of speech of each word in the text, and identifying whether the word is a noun, a verb, an adjective and the like.
(6) Conversion to lowercase to avoid that "Apple" and "Apple" are treated as two different words, english text is converted to lowercase.
(7) Building a vocabulary, namely building a unified vocabulary for subsequent text representation and vectorization after all texts are processed.
Step 102, inputting a vocabulary of a current document into a Word2Vec model which is trained in advance, extracting Word2Vec Word vectors of each Word in the vocabulary, and calculating TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set to construct TF-IDF feature vectors of each Word.
The preset document set may include all documents that have been processed, as well as documents prepared in advance for training the Word2Vec model.
And 103, carrying out vector feature fusion on the TF-IDF feature vector and Word2Vec Word vector of each Word to form vector representation of each Word.
Step 104, matching the corresponding target error word in a preset database according to the vector representation of each word.
The preset database comprises a plurality of error words and standard words which are in one-to-one correspondence, wherein the error words refer to words which do not accord with preset standards, the standard words are words which accord with the preset standards, and the preset standards are determined according to industry standards and expert rules.
And 105, replacing each word in the current document by using the normative word corresponding to the target error word to realize text correction of the current document.
The text correction method provided by the invention integrates experience rules in the text editing process, establishes a high-quality error word-normative word database in advance, changes the traditional method into an integrated computing system to be automatically completed by mainly relying on manual editing (greatly influenced by individual learning and memory capacity), integrates originally scattered, multiple and complex scientific paper editing work into a computer program through data integration, maximally utilizes the experience rules, effectively avoids the loss of work results, is beneficial to guiding an editing beginner to quickly grasp key points, reduces the editing difficulty of the text, greatly improves the text editing efficiency, and reduces the editing and correction quality error rate of the text.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention calculates, using a TF-IDF algorithm, a TF-IDF value corresponding to each word in each document, and constructs a TF-IDF feature vector of each word, including the following steps:
(1) And calculating word frequency TF of each word in a vocabulary of each document, and calculating inverse document frequency IDF of each word in a preset document set.
Word frequency (TF) refers to the number of times a word appears in a document divided by the total number of words in the document. For the term t in document d, TF can be expressed as:
The Inverse Document Frequency (IDF) reflects the importance of a term in the entire document collection. For the word t, the IDF can be expressed as:
(2) According to the word frequency TF of each word in the vocabulary of each document and the inverse document frequency IDF in the preset document set, calculating a TF-IDF value corresponding to each word in each document;
The TF-IDF value is the product of TF and IDF, expressed as:
(3) And constructing TF-IDF characteristic vectors of each vocabulary according to the TF-IDF values corresponding to each vocabulary in each document.
For a word t, itFeature vectorCan be expressed as:
Wherein,Are individual documents in the preset document set D.
It should be noted that if the vocabulary t appears in the document d, the vectorThe position of the corresponding document d in isIf not present, will generallyThe position of the corresponding document d is set to 0.
According to the method, the TF-IDF feature vector can be constructed for each vocabulary, so that subsequent feature fusion is facilitated.
Based on the foregoing embodiments, as an alternative embodiment, the text correction method provided by the present invention, the Word2Vec model is usually obtained by training a large amount of text data, so as to learn Word vectors of each Word, and these vectors can reflect the semantic relationship between the words in a multidimensional space.
The training method of the Word2Vec model comprises the steps of constructing a corresponding vocabulary based on a pre-prepared document set to form a training data set, selecting a Skip-Gram model, setting super-parameters, and training the model until the model converges based on the training data set to generate Word vectors with excellent performance.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention performs vector feature fusion on the TF-IDF feature vector and the Word2Vec Word vector of each Word to form a vector representation of each Word, and includes the following steps:
(1) And carrying out dimension unification on the TF-IDF characteristic vector and the Word2Vec Word vector of each Word.
There is often a problem of dimension inconsistency between the TF-IDF feature vector and the Word2Vec Word vector. Alternatively, the present invention maps the TF-IDF feature vector to the same dimensional space as the Word2Vec Word vector. As another alternative, the present invention may also directly perform the dimension-increasing processing on the vector with the low dimension by means of zero setting, so that the dimensions of the two vectors are consistent.
(2) And carrying out weighted feature fusion on the TF-IDF feature vector and the Word2Vec Word vector to form a weighted fusion vector.
Weighted fusion refers to assigning different weights to TF-IDF feature vectors and Word2Vec Word vectors to emphasize or deemphasize the importance of certain features. The choice of weights may be determined based on the needs of the task or by cross-validation.
The invention determines a weight coefficient alpha (which can be obtained through experiments) and is used for weighting TF-IDF feature vectors and Word2Vec Word vectors, wherein each element of the TF-IDF vectors is multiplied by alpha, and each element of the Word2Vec vectors is multiplied by 1-alpha, then:
Weighted TF-IDF eigenvector = a·tf-IDF eigenvector;
Weighted Word2Vec Word vector= (1-a) ·word2Vec Word vector.
Adding the weighted TF-IDF vector and Word2Vec vector to obtain a weighted fusion vectorThe method comprises the following steps:
=weighted TF-IDF feature vector+weighted Word2Vec Word vector.
(3) And carrying out normalization processing on the weighted fusion vector to generate a vector representation of each vocabulary.
Wherein,Is a normalized vector representation.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention further includes adjusting a Word2Vec model and/or a vector feature fusion manner according to the matching result and/or the replacement result.
Specifically, the present embodiment is to further optimize the effect of text correction. The following are optional adjustment methods:
(1) The Word2Vec model may be tuned by:
retraining the model-if it is found that some of the erroneous words are not correctly identified, it is contemplated that more domain-related training data may be added to the training set and then the Word2Vec model is retrained. This allows the model to better capture the semantic information of these particular words.
Fine tuning model if a pre-trained model is already available, it can be fine-tuned (fine-tuning) according to the specific task. This means that the model continues to be trained using data in the set of pre-set documents, making the model better suited for the particular domain.
And adjusting super parameters, namely adjusting some super parameters of the Word2Vec model, such as window size, dimension number, minimum Word frequency threshold value and the like, according to the matching effect so as to obtain more proper Word vectors.
(2) The vector feature fusion approach may be adjusted by:
Weighted fusion-different weighting schemes may be tried to combine the TF-IDF feature vector and the Word2Vec Word vector. For example, the weight ratio of the TF-IDF value and the word vector may be adjusted based on the final correction result to find the optimal fusion ratio.
Dynamic adjustment, namely dynamically adjusting the fusion mode according to different document types or fields. For example, for documents that are highly focused on lexical accuracy, the weight of the TF-IDF feature vector may be increased, while for documents that are more focused on semantic consistency, the weight of the Word2Vec Word vector may be increased.
Through the method, the method can continuously optimize the mode of combining the Word2Vec model and the vector feature, so that the accuracy and the efficiency of text correction are improved.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention matches, according to the vector representation of each vocabulary, a corresponding target error word in a preset database, and includes the following steps:
(1) And calculating the similarity between the vector representation of each word and the vector representation of the error word in the preset database.
Commonly used similarity measures are cosine similarity, euclidean distance, etc. The cosine similarity can be selected to calculate the similarity between the two vectors.
The construction method can refer to a method for solving the vector representation of any word in the invention, and the description is omitted here.
(2) And under the condition that error words with similarity larger than a preset similarity threshold value exist in a preset database, taking the error word with the maximum similarity as the target error word.
The invention can define a preset similarity threshold through historical experience, and only when the calculated similarity is larger than the threshold, the vocabulary is considered to be matched with a certain wrong word in the database.
Further, for each word, if there are a plurality of error words with their similarity greater than a threshold, the error word with the greatest similarity is selected as the target error word.
In summary, the invention realizes more accurate matching and replacement by establishing an experience rule database and integrating the advantages of Word vectors generated by Word2Vec and feature vectors extracted by TF-IDF algorithm, and helps to help the editing beginners to quickly grasp the key points by comparing and correcting the pre-entered experience rules with texts, thereby greatly improving the editing efficiency of scientific papers and effectively reducing the editing and correcting quality error rate of scientific journals.
Further, fig. 2 is a second flow chart of the text correction method provided by the present invention, and fig. 3 is a frame chart of the scientific paper standardization intelligent detection correction system related to the present invention, so that the technical solution of the present invention can be more clearly described, and a complete implementation process is described below with reference to fig. 2 and 3.
(1) And designing a software page, namely displaying four functions of data input, document opening, editing and modifying, and inquiring and modifying records by a software interface.
(2) Editing and collecting experience rules (including examination and reading meeting, manuscript meeting rule, industry standard, expert advice and the like) in the process of publishing national, industry standard and scientific paper, inputting the error words and the normative words into a database one by one, and perfecting an information storage module.
Specifically, empirical rules include, but are not limited to, the following two broad categories:
The first category refers to national and industry standards (hereinafter referred to as "standards"), such as "academic paper writing rules" (GB/T7713.2-2022), international units System and applications thereof (GB/T3100-93), information and literature reference writing rules (GB/T7714-2015), and the like, and mainly includes paper formats, sentences, and the like.
The second category refers to rules (including examination and reading meeting, convention rule of manuscript meeting, industry standard, expert suggestion, etc.) repeatedly searched and summarized by the technical journal editor in the long-term run a periodical process, including but not limited to, the use of special nouns and special verbs (including English abbreviations) under different contexts, common practice of fixed expression engineering, formula symbols, common standardized quantity names, unit symbols, error prone words, etc.
Alternatively, the "wrong word" in the present invention is defined as follows:
The first category is statement errors, wrong words, improper collocations and the like which are clear in the standard, for example (correct in brackets), wherein the amplitude (radiation), the multiple (preparation) are concerned, the body (life) is taken out of a certain place, the grid division (division), the winding (scratching) degree, the elimination (cutting) is weak, the runoff size is simulated (imitated), the rule (normalization) is carried out, the formula is entered in the tape (generation), the wedge (wedge) is not broken, the construction period is wrong;
The second category is that sentences (words) are not wrong, but are not properly used in science and technology theory or are irregular, and obsolete nouns and terms are often difficult to identify when detected by conventional means. For example, noise (noise), conductivity (conductivity), probability (probability), threshold (threshold), specific heat (specific heat capacity), heat conduction factor (heat conduction coefficient), specific gravity (density or relative density), mechanism (mechanism), concrete (concrete), mechanical properties (mechanical properties), test days/d (test time/d), weight 100g (mass 100 g), X is a dimensionless proportionality constant (X is a dimensionless proportionality constant), and the like. The error word is determined according to expert advice, for example, in the classification of rock mass, in the road and railway tunnel, the rock mass is classified into I-V class (engineering rock mass classification standard (GB/T50218)), in the hydraulic and hydroelectric engineering, the rock mass is classified into I-V class (hydraulic and hydroelectric engineering geological survey standard (GB 50487)), and the use of the technical terms is different under different use backgrounds.
Alternatively, the "normative word" in the present invention is defined as follows:
the science and technology theory accords with the words of each subject noun and related standard published by national natural science noun approval committee, and the clear standard words and usage in the Chinese academy journal (compact disc edition) retrieval and evaluation data standard.
For ease of understanding, the common misphrase "X is a dimensionless proportionality constant" in scientific papers is taken as an example for illustration. "dimensionless" is input to "error word", "dimensionless one" is input to "canonical word", and the system records and stores.
(3) Extracting text in the Word document, and preprocessing the text in the Word document, wherein the preprocessing comprises the steps of Word segmentation, stop Word removal and the like, so as to obtain key words in the text.
Specifically, the text is segmented and stop words are removed, so that [ "X", "dimensionless", "proportional", "constant" ] is obtained.
(4) Generating TF-IDF feature vectors, training Word2Vec models, and combining the TF-IDF feature vectors with Word2Vec Word vectors through weighted average to form a new vector representation.
By the TF-IDF feature vector solving method described in the above embodiment, the TF-IDF vector of "wrong word" i.e. "dimensionless" in this example is obtained as [0.5,0.2,0.3].
The Word2Vec Word vector of "dimensionless" found using the Word2Vec model is [0.5, -0.1,0.2].
In this embodiment, the alpha is 0.5, that is, the contribution of the alpha and the contribution of the alpha are equal, and the weighted fusion vector is:
+
Further, normalization processing is performed:
(5) And performing matching operation by using the fused vector representation. When the matched 'error word' is searched, the corresponding 'standard word' is found in the database for replacement.
In this embodiment, the "dimensionless" is replaced with "dimension one".
It should be noted that, according to the database query result, it is determined whether to replace the term in the original text. If the corresponding error word is found, the substitution is executed, and if the error word is not found, the error word is kept as it is or other preset strategies are adopted.
(6) According to the actual matching and replacing effects, model parameters, feature fusion methods and the like are adjusted and optimized, so that the accuracy and efficiency of operation are improved.
Specifically, if "dimensionless" is found to be replaced by error, the model can be retrained by adjusting the value of the weight coefficient alpha, or reselecting other parameters of the TF-IDF and Word2Vec models.
(7) The editors confirm the words to be replaced by the system one by one, and the editors finish modification and save after confirmation.
Specifically, the professional examines and confirms the result of the automatic replacement, ensures the accuracy of the replacement, and the staff finds that "dimensionless" in "X is a proportionality constant of dimensionless" is correctly replaced with "dimensionless one" in the examination process, thus confirming the replacement. If the replacement is incorrect, the staff will make corrections and provide feedback to the system to optimize the replacement algorithm.
As an optional embodiment, the invention also provides a specific process for text editing by using the scientific paper standardization intelligent detection and correction system, which is as follows:
(1) The data input is that the 'wrong word' and the 'normative word' in the experience rule are input into a preset database according to classification (professional name word class, common convention expression class, standardized quantity name word class, error prone word class and the like);
(2) Opening the document, importing the document to be edited.
(3) The information modification comprises the steps of extracting text information in a Word document by a system, preprocessing the text in the Word document, comparing the extracted data information with 'error words' in a database, directly skipping if no 'error words', and replacing 'error words' with 'standard words' if 'error words' occur.
(4) And confirming modification, namely confirming modification content, and automatically adding the confirmed modification record into a data information base for storage and backup by a program.
(5) And (5) checking the modification record, namely storing and inquiring the confirmed modification information.
It should be noted that the above system is only based on the text correction method provided by the present invention, and a feasible text editing system is designed, so that a developer can divide different functional modules of the system according to actual needs, and design a more graceful man-machine interface, but only by adopting the text correction method provided by the present invention, the text correction method is substantially within the protection scope of the present invention.
FIG. 4 is a schematic diagram of a text correction apparatus according to the present invention, as shown in FIG. 4, the apparatus includes a document processing module 410, a feature extraction module 420, a feature fusion module 430, a matching module 440, and a text replacement correction module 450;
A document processing module 410, configured to pre-process a current document to construct a vocabulary of the current document;
The feature extraction module 420 is configured to input a vocabulary of a current document into a Word2Vec model that is trained in advance, extract Word2Vec Word vectors of each Word in the vocabulary, and calculate TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set, so as to construct TF-IDF feature vectors of each Word;
The feature fusion module 430 is configured to perform vector feature fusion on the TF-IDF feature vector and Word2Vec Word vector of each Word to form a vector representation of each Word;
The matching module 440 is configured to match the corresponding target error word in a preset database according to the vector representation of each word, where the preset database includes multiple error words and standard words that are in one-to-one correspondence, the error words are words that do not meet a preset standard, the standard words are words that meet the preset standard, and the preset standard is determined according to an industry standard and expert rules;
And the text replacement correction module 450 is configured to replace each word with a canonical word corresponding to the target error word, so as to correct the text of the current document.
It should be noted that, when the text correction apparatus provided in the embodiment of the present invention is specifically executed, the text correction method described in any one of the above embodiments may be executed, which is not described in detail in this embodiment.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include a processor (processor) 510, a communication interface (communications interface) 520, a memory (memory) 530, and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 complete communication with each other through the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the text correction method.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the text correction method provided by the above embodiments.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the text correction method provided by the above embodiments.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims (9)

Translated fromChinese
1.一种文本纠正方法,其特征在于,包括:1. A text correction method, comprising:对当前文档进行预处理,构建当前文档的词汇表;Preprocess the current document and build the vocabulary of the current document;将当前文档的词汇表输入至预先训练完成的Word2Vec模型,提取词汇表中每一词汇的Word2Vec词向量;并且,在预设文档集合下,利用TF-IDF算法计算每个文档中所述每一词汇对应的TF-IDF值,构建所述每一词汇的TF-IDF特征向量;Input the vocabulary of the current document into the pre-trained Word2Vec model, extract the Word2Vec word vector of each word in the vocabulary; and, under the preset document set, use the TF-IDF algorithm to calculate the TF-IDF value corresponding to each word in each document, and construct the TF-IDF feature vector of each word;对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行向量特征融合形成所述每一词汇的向量表示;Performing vector feature fusion on the TF-IDF feature vector and the Word2Vec word vector of each word to form a vector representation of each word;根据所述每一词汇的向量表示,在预设数据库中匹配对应的目标错误词;其中,所述预设数据库中包括一一对应的多种错误词与规范词,所述错误词是指不符合预设标准的词语,所述规范词是符合预设标准的词语,所述预设标准是根据行业标准和专家规则确定的;According to the vector representation of each word, a corresponding target error word is matched in a preset database; wherein the preset database includes a plurality of error words and standard words corresponding to each other, the error word refers to a word that does not meet the preset standard, and the standard word refers to a word that meets the preset standard, and the preset standard is determined according to industry standards and expert rules;利用目标错误词对应的规范词,对当前文档中的所述每一词汇进行替换,以实现对当前文档的文本纠正;Using the standard word corresponding to the target incorrect word, each word in the current document is replaced to correct the text of the current document;对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行向量特征融合形成所述每一词汇的向量表示,包括:Performing vector feature fusion on the TF-IDF feature vector and the Word2Vec word vector of each word to form a vector representation of each word, including:对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行维度统一;Unifying the dimensions of the TF-IDF feature vector and the Word2Vec word vector of each word;对所述TF-IDF特征向量和所述Word2Vec词向量进行加权特征融合,形成加权融合向量;Performing weighted feature fusion on the TF-IDF feature vector and the Word2Vec word vector to form a weighted fusion vector;对所述加权融合向量进行归一化处理,生成所述每一词汇的向量表示;Normalizing the weighted fusion vector to generate a vector representation of each word;还包括:对于注重词汇准确性的文档,增加TF-IDF特征向量的权重;对于更关注语义连贯性的文档,则增加Word2Vec词向量的权重。It also includes: for documents that focus on lexical accuracy, increase the weight of TF-IDF feature vectors; for documents that focus more on semantic coherence, increase the weight of Word2Vec word vectors.2.根据权利要求1所述的文本纠正方法,其特征在于,还包括:2. The text correction method according to claim 1, further comprising:根据匹配结果和/或替换结果,对Word2Vec模型和/或向量特征融合的方式进行调整。According to the matching result and/or the replacement result, the Word2Vec model and/or the vector feature fusion method are adjusted.3.根据权利要求1所述的文本纠正方法,其特征在于,对当前文档进行预处理,构建当前文档的词汇表,包括:3. The text correction method according to claim 1, characterized in that preprocessing the current document to construct a vocabulary of the current document comprises:对当前文档进行文本清洗和分词,以构建当前文档的词汇表。Perform text cleaning and word segmentation on the current document to build the vocabulary of the current document.4.根据权利要求1所述的文本纠正方法,其特征在于,在预设文档集合下,利用TF-IDF算法计算每个文档中所述每一词汇对应的TF-IDF值,构建所述每一词汇的TF-IDF特征向量,包括:4. The text correction method according to claim 1 is characterized in that, under a preset document set, the TF-IDF value corresponding to each word in each document is calculated using a TF-IDF algorithm to construct a TF-IDF feature vector of each word, including:计算所述每一词汇在每个文档的词汇表中的词频TF,计算所述每一词汇在预设文档集合中的逆文档频率IDF;Calculate the word frequency TF of each word in the vocabulary of each document, and calculate the inverse document frequency IDF of each word in the preset document set;根据所述每一词汇在每个文档的词汇表中的词频TF,以及在预设文档集合中的逆文档频率IDF,计算每个文档中的所述每一词汇对应的TF-IDF值;Calculate the TF-IDF value corresponding to each word in each document according to the word frequency TF of each word in the vocabulary of each document and the inverse document frequency IDF in the preset document set;根据每个文档中的所述每一词汇对应的TF-IDF值,构建所述每一词汇的TF-IDF特征向量。According to the TF-IDF value corresponding to each word in each document, a TF-IDF feature vector of each word is constructed.5.根据权利要求1所述的文本纠正方法,其特征在于,根据所述每一词汇的向量表示,在预设数据库中匹配对应的目标错误词,包括:5. The text correction method according to claim 1, characterized in that, according to the vector representation of each word, matching the corresponding target error word in a preset database comprises:计算所述每一词汇的向量表示与预设数据库中错误词的向量表示之间的相似度;其中,错误词的向量表示是预先构建完成的;Calculating the similarity between the vector representation of each word and the vector representation of the wrong word in the preset database; wherein the vector representation of the wrong word is pre-constructed;在预设数据库中存在与所述每一词汇的相似度大于预设相似度阈值的错误词的情况下,将相似度最大的错误词作为所述目标错误词。When there is an erroneous word in the preset database whose similarity with each of the words is greater than a preset similarity threshold, the erroneous word with the greatest similarity is used as the target erroneous word.6.根据权利要求1所述的文本纠正方法,其特征在于,对当前文档进行预处理的操作还至少包括以下操作中的一种:6. The text correction method according to claim 1, wherein the operation of preprocessing the current document further comprises at least one of the following operations:去除停用词、词干提取、词性标注以及转换为小写操作。Remove stop words, stem, tag parts of speech, and convert to lower case.7.一种文本纠正装置,其特征在于,包括:7. A text correction device, comprising:文档处理模块,用于对当前文档进行预处理,构建当前文档的词汇表;The document processing module is used to pre-process the current document and build the vocabulary of the current document;特征提取模块,用于将当前文档的词汇表输入至预先训练完成的Word2Vec模型,提取词汇表中每一词汇的Word2Vec词向量;并且,在预设文档集合下,利用TF-IDF算法计算每个文档中所述每一词汇对应的TF-IDF值,构建所述每一词汇的TF-IDF特征向量;The feature extraction module is used to input the vocabulary of the current document into the pre-trained Word2Vec model to extract the Word2Vec word vector of each word in the vocabulary; and, under a preset document set, use the TF-IDF algorithm to calculate the TF-IDF value corresponding to each word in each document, and construct the TF-IDF feature vector of each word;特征融合模块,用于对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行向量特征融合形成所述每一词汇的向量表示;A feature fusion module, used for performing vector feature fusion on the TF-IDF feature vector and the Word2Vec word vector of each word to form a vector representation of each word;对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行向量特征融合形成所述每一词汇的向量表示,包括:Performing vector feature fusion on the TF-IDF feature vector and the Word2Vec word vector of each word to form a vector representation of each word, including:对所述每一词汇的TF-IDF特征向量和Word2Vec词向量进行维度统一;Unifying the dimensions of the TF-IDF feature vector and the Word2Vec word vector of each word;对所述TF-IDF特征向量和所述Word2Vec词向量进行加权特征融合,形成加权融合向量;Performing weighted feature fusion on the TF-IDF feature vector and the Word2Vec word vector to form a weighted fusion vector;对所述加权融合向量进行归一化处理,生成所述每一词汇的向量表示;Normalizing the weighted fusion vector to generate a vector representation of each word;匹配模块,用于根据所述每一词汇的向量表示,在预设数据库中匹配对应的目标错误词;其中,所述预设数据库中包括一一对应的多种错误词与规范词,所述错误词是指不符合预设标准的词语,所述规范词是符合预设标准的词语,所述预设标准是根据行业标准和专家规则确定的;A matching module, used for matching a corresponding target wrong word in a preset database according to the vector representation of each word; wherein the preset database includes a plurality of wrong words and standard words in one-to-one correspondence, wherein the wrong word refers to a word that does not meet the preset standard, and the standard word refers to a word that meets the preset standard, and the preset standard is determined according to industry standards and expert rules;文本替换纠正模块,用于利用目标错误词对应的规范词,对所述每一词汇进行替换,以实现对当前文档的文本纠正;A text replacement correction module, used to replace each of the words with a standard word corresponding to the target erroneous word, so as to correct the text of the current document;所述装置,还用于对于注重词汇准确性的文档,增加TF-IDF特征向量的权重;对于更关注语义连贯性的文档,则增加Word2Vec词向量的权重。The device is also used to increase the weight of the TF-IDF feature vector for documents that focus on vocabulary accuracy; and to increase the weight of the Word2Vec word vector for documents that focus more on semantic coherence.8.一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述文本纠正方法的步骤。8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the text correction method according to any one of claims 1 to 6 when executing the computer program.9.一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述文本纠正方法的步骤。9. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the text correction method according to any one of claims 1 to 6 are implemented.
CN202411546772.5A2024-11-012024-11-01 Text correction method and deviceActiveCN119047437B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411546772.5ACN119047437B (en)2024-11-012024-11-01 Text correction method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411546772.5ACN119047437B (en)2024-11-012024-11-01 Text correction method and device

Publications (2)

Publication NumberPublication Date
CN119047437A CN119047437A (en)2024-11-29
CN119047437Btrue CN119047437B (en)2025-02-18

Family

ID=93586373

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411546772.5AActiveCN119047437B (en)2024-11-012024-11-01 Text correction method and device

Country Status (1)

CountryLink
CN (1)CN119047437B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115455944A (en)*2022-09-142022-12-09中国工商银行股份有限公司Text processing method and device and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10762293B2 (en)*2010-12-222020-09-01Apple Inc.Using parts-of-speech tagging and named entity recognition for spelling correction
US9245015B2 (en)*2013-03-082016-01-26Accenture Global Services LimitedEntity disambiguation in natural language text
CN112364633B (en)*2021-01-132021-04-13浙江一意智能科技有限公司Character error acquisition and correction method, device and storage medium
CN113297410A (en)*2021-07-262021-08-24广东众聚人工智能科技有限公司Image retrieval method and device, computer equipment and storage medium
KR102596190B1 (en)*2023-04-122023-10-31(주)액션파워Method for editing text information
CN116306600B (en)*2023-05-252023-08-11山东齐鲁壹点传媒有限公司MacBert-based Chinese text error correction method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115455944A (en)*2022-09-142022-12-09中国工商银行股份有限公司Text processing method and device and electronic equipment

Also Published As

Publication numberPublication date
CN119047437A (en)2024-11-29

Similar Documents

PublicationPublication DateTitle
CN113239210B (en) Water conservancy literature recommendation method and system based on automatic completion of knowledge graph
CN109885683B (en) A Method of Generating Text Summary Based on K-means Model and Neural Network Model
CN109918666B (en)Chinese punctuation mark adding method based on neural network
CN113011533A (en)Text classification method and device, computer equipment and storage medium
Berg-Kirkpatrick et al.Unsupervised transcription of historical documents
CN116167362A (en) Model training method, Chinese text error correction method, electronic device and storage medium
CN108804428A (en)Correcting method, system and the relevant apparatus of term mistranslation in a kind of translation
CN109992775B (en)Text abstract generation method based on high-level semantics
CN103116578A (en)Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN108287911B (en)Relation extraction method based on constrained remote supervision
CN111814477B (en) Dispute focus discovery method, device and terminal based on dispute focus entity
CN112069826A (en) Vertical Domain Entity Disambiguation Method Fusing Topic Models and Convolutional Neural Networks
CN117251524A (en)Short text classification method based on multi-strategy fusion
CN118152520A (en)Automatic rapid knowledge base construction method, system and device based on large language model technology
CN117034327B (en)E-book content encryption protection method
CN119003712A (en)Nuclear power knowledge question-answering method and system based on large language model
CN111898337B (en)Automatic generation method of single sentence abstract defect report title based on deep learning
CN120338395A (en) A method and system for distinguishing R&D investment projects based on large language model
CN119203997B (en)Text data enhancement method based on government text unhappy choice of words error correction
CN119337820B (en)Automatic typesetting method and device for articles
CN114756617A (en) A method, system, device and storage medium for extracting structured data of engineering archives
CN119047437B (en) Text correction method and device
CN118467733A (en) A text analysis method, device, equipment and storage medium
CN115658956B (en)Hot topic mining method and system based on conference audio data
CN117131863A (en)Text generation method, device, equipment and medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp