Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The text correction method and apparatus provided by the embodiment of the present invention are described below with reference to fig. 1 to 5.
FIG. 1 is a schematic flow chart of a text correction method according to the present invention, as shown in FIG. 1, including but not limited to the following steps:
step 101, preprocessing the current document to construct a vocabulary of the current document.
The current document may be a Word document (in the present invention, a scientific paper is generally referred to), and optionally, text cleaning and Word segmentation are performed on the current document to construct a vocabulary of the current document, which may, of course, be performed in other operations to obtain a clean vocabulary. The following is an example of building a vocabulary for a current document:
(1) Text cleaning, namely removing irrelevant characters in the current document, such as special symbols, punctuation marks and the like, and only retaining text information.
(2) Word segmentation, namely dividing continuous text character strings into separate words or terms.
(3) The stop words are words which frequently appear in the text but do not greatly contribute to the meaning of the text, such as 'yes' and 'in', and the like, and the noise of subsequent processing can be reduced by removing the words from the text.
(4) Stem extraction-for some languages, such as english, it may be desirable to reduce the vocabulary to a basic form, e.g., to convert "running" to "run".
(5) Part of speech tagging, namely, marking the part of speech of each word in the text, and identifying whether the word is a noun, a verb, an adjective and the like.
(6) Conversion to lowercase to avoid that "Apple" and "Apple" are treated as two different words, english text is converted to lowercase.
(7) Building a vocabulary, namely building a unified vocabulary for subsequent text representation and vectorization after all texts are processed.
Step 102, inputting a vocabulary of a current document into a Word2Vec model which is trained in advance, extracting Word2Vec Word vectors of each Word in the vocabulary, and calculating TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set to construct TF-IDF feature vectors of each Word.
The preset document set may include all documents that have been processed, as well as documents prepared in advance for training the Word2Vec model.
And 103, carrying out vector feature fusion on the TF-IDF feature vector and Word2Vec Word vector of each Word to form vector representation of each Word.
Step 104, matching the corresponding target error word in a preset database according to the vector representation of each word.
The preset database comprises a plurality of error words and standard words which are in one-to-one correspondence, wherein the error words refer to words which do not accord with preset standards, the standard words are words which accord with the preset standards, and the preset standards are determined according to industry standards and expert rules.
And 105, replacing each word in the current document by using the normative word corresponding to the target error word to realize text correction of the current document.
The text correction method provided by the invention integrates experience rules in the text editing process, establishes a high-quality error word-normative word database in advance, changes the traditional method into an integrated computing system to be automatically completed by mainly relying on manual editing (greatly influenced by individual learning and memory capacity), integrates originally scattered, multiple and complex scientific paper editing work into a computer program through data integration, maximally utilizes the experience rules, effectively avoids the loss of work results, is beneficial to guiding an editing beginner to quickly grasp key points, reduces the editing difficulty of the text, greatly improves the text editing efficiency, and reduces the editing and correction quality error rate of the text.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention calculates, using a TF-IDF algorithm, a TF-IDF value corresponding to each word in each document, and constructs a TF-IDF feature vector of each word, including the following steps:
(1) And calculating word frequency TF of each word in a vocabulary of each document, and calculating inverse document frequency IDF of each word in a preset document set.
Word frequency (TF) refers to the number of times a word appears in a document divided by the total number of words in the document. For the term t in document d, TF can be expressed as:
。
The Inverse Document Frequency (IDF) reflects the importance of a term in the entire document collection. For the word t, the IDF can be expressed as:
(2) According to the word frequency TF of each word in the vocabulary of each document and the inverse document frequency IDF in the preset document set, calculating a TF-IDF value corresponding to each word in each document;
The TF-IDF value is the product of TF and IDF, expressed as:
(3) And constructing TF-IDF characteristic vectors of each vocabulary according to the TF-IDF values corresponding to each vocabulary in each document.
For a word t, itFeature vectorCan be expressed as:
Wherein,Are individual documents in the preset document set D.
It should be noted that if the vocabulary t appears in the document d, the vectorThe position of the corresponding document d in isIf not present, will generallyThe position of the corresponding document d is set to 0.
According to the method, the TF-IDF feature vector can be constructed for each vocabulary, so that subsequent feature fusion is facilitated.
Based on the foregoing embodiments, as an alternative embodiment, the text correction method provided by the present invention, the Word2Vec model is usually obtained by training a large amount of text data, so as to learn Word vectors of each Word, and these vectors can reflect the semantic relationship between the words in a multidimensional space.
The training method of the Word2Vec model comprises the steps of constructing a corresponding vocabulary based on a pre-prepared document set to form a training data set, selecting a Skip-Gram model, setting super-parameters, and training the model until the model converges based on the training data set to generate Word vectors with excellent performance.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention performs vector feature fusion on the TF-IDF feature vector and the Word2Vec Word vector of each Word to form a vector representation of each Word, and includes the following steps:
(1) And carrying out dimension unification on the TF-IDF characteristic vector and the Word2Vec Word vector of each Word.
There is often a problem of dimension inconsistency between the TF-IDF feature vector and the Word2Vec Word vector. Alternatively, the present invention maps the TF-IDF feature vector to the same dimensional space as the Word2Vec Word vector. As another alternative, the present invention may also directly perform the dimension-increasing processing on the vector with the low dimension by means of zero setting, so that the dimensions of the two vectors are consistent.
(2) And carrying out weighted feature fusion on the TF-IDF feature vector and the Word2Vec Word vector to form a weighted fusion vector.
Weighted fusion refers to assigning different weights to TF-IDF feature vectors and Word2Vec Word vectors to emphasize or deemphasize the importance of certain features. The choice of weights may be determined based on the needs of the task or by cross-validation.
The invention determines a weight coefficient alpha (which can be obtained through experiments) and is used for weighting TF-IDF feature vectors and Word2Vec Word vectors, wherein each element of the TF-IDF vectors is multiplied by alpha, and each element of the Word2Vec vectors is multiplied by 1-alpha, then:
Weighted TF-IDF eigenvector = a·tf-IDF eigenvector;
Weighted Word2Vec Word vector= (1-a) ·word2Vec Word vector.
Adding the weighted TF-IDF vector and Word2Vec vector to obtain a weighted fusion vectorThe method comprises the following steps:
=weighted TF-IDF feature vector+weighted Word2Vec Word vector.
(3) And carrying out normalization processing on the weighted fusion vector to generate a vector representation of each vocabulary.
Wherein,Is a normalized vector representation.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention further includes adjusting a Word2Vec model and/or a vector feature fusion manner according to the matching result and/or the replacement result.
Specifically, the present embodiment is to further optimize the effect of text correction. The following are optional adjustment methods:
(1) The Word2Vec model may be tuned by:
retraining the model-if it is found that some of the erroneous words are not correctly identified, it is contemplated that more domain-related training data may be added to the training set and then the Word2Vec model is retrained. This allows the model to better capture the semantic information of these particular words.
Fine tuning model if a pre-trained model is already available, it can be fine-tuned (fine-tuning) according to the specific task. This means that the model continues to be trained using data in the set of pre-set documents, making the model better suited for the particular domain.
And adjusting super parameters, namely adjusting some super parameters of the Word2Vec model, such as window size, dimension number, minimum Word frequency threshold value and the like, according to the matching effect so as to obtain more proper Word vectors.
(2) The vector feature fusion approach may be adjusted by:
Weighted fusion-different weighting schemes may be tried to combine the TF-IDF feature vector and the Word2Vec Word vector. For example, the weight ratio of the TF-IDF value and the word vector may be adjusted based on the final correction result to find the optimal fusion ratio.
Dynamic adjustment, namely dynamically adjusting the fusion mode according to different document types or fields. For example, for documents that are highly focused on lexical accuracy, the weight of the TF-IDF feature vector may be increased, while for documents that are more focused on semantic consistency, the weight of the Word2Vec Word vector may be increased.
Through the method, the method can continuously optimize the mode of combining the Word2Vec model and the vector feature, so that the accuracy and the efficiency of text correction are improved.
Based on the foregoing embodiment, as an optional embodiment, the text correction method provided by the present invention matches, according to the vector representation of each vocabulary, a corresponding target error word in a preset database, and includes the following steps:
(1) And calculating the similarity between the vector representation of each word and the vector representation of the error word in the preset database.
Commonly used similarity measures are cosine similarity, euclidean distance, etc. The cosine similarity can be selected to calculate the similarity between the two vectors.
The construction method can refer to a method for solving the vector representation of any word in the invention, and the description is omitted here.
(2) And under the condition that error words with similarity larger than a preset similarity threshold value exist in a preset database, taking the error word with the maximum similarity as the target error word.
The invention can define a preset similarity threshold through historical experience, and only when the calculated similarity is larger than the threshold, the vocabulary is considered to be matched with a certain wrong word in the database.
Further, for each word, if there are a plurality of error words with their similarity greater than a threshold, the error word with the greatest similarity is selected as the target error word.
In summary, the invention realizes more accurate matching and replacement by establishing an experience rule database and integrating the advantages of Word vectors generated by Word2Vec and feature vectors extracted by TF-IDF algorithm, and helps to help the editing beginners to quickly grasp the key points by comparing and correcting the pre-entered experience rules with texts, thereby greatly improving the editing efficiency of scientific papers and effectively reducing the editing and correcting quality error rate of scientific journals.
Further, fig. 2 is a second flow chart of the text correction method provided by the present invention, and fig. 3 is a frame chart of the scientific paper standardization intelligent detection correction system related to the present invention, so that the technical solution of the present invention can be more clearly described, and a complete implementation process is described below with reference to fig. 2 and 3.
(1) And designing a software page, namely displaying four functions of data input, document opening, editing and modifying, and inquiring and modifying records by a software interface.
(2) Editing and collecting experience rules (including examination and reading meeting, manuscript meeting rule, industry standard, expert advice and the like) in the process of publishing national, industry standard and scientific paper, inputting the error words and the normative words into a database one by one, and perfecting an information storage module.
Specifically, empirical rules include, but are not limited to, the following two broad categories:
The first category refers to national and industry standards (hereinafter referred to as "standards"), such as "academic paper writing rules" (GB/T7713.2-2022), international units System and applications thereof (GB/T3100-93), information and literature reference writing rules (GB/T7714-2015), and the like, and mainly includes paper formats, sentences, and the like.
The second category refers to rules (including examination and reading meeting, convention rule of manuscript meeting, industry standard, expert suggestion, etc.) repeatedly searched and summarized by the technical journal editor in the long-term run a periodical process, including but not limited to, the use of special nouns and special verbs (including English abbreviations) under different contexts, common practice of fixed expression engineering, formula symbols, common standardized quantity names, unit symbols, error prone words, etc.
Alternatively, the "wrong word" in the present invention is defined as follows:
The first category is statement errors, wrong words, improper collocations and the like which are clear in the standard, for example (correct in brackets), wherein the amplitude (radiation), the multiple (preparation) are concerned, the body (life) is taken out of a certain place, the grid division (division), the winding (scratching) degree, the elimination (cutting) is weak, the runoff size is simulated (imitated), the rule (normalization) is carried out, the formula is entered in the tape (generation), the wedge (wedge) is not broken, the construction period is wrong;
The second category is that sentences (words) are not wrong, but are not properly used in science and technology theory or are irregular, and obsolete nouns and terms are often difficult to identify when detected by conventional means. For example, noise (noise), conductivity (conductivity), probability (probability), threshold (threshold), specific heat (specific heat capacity), heat conduction factor (heat conduction coefficient), specific gravity (density or relative density), mechanism (mechanism), concrete (concrete), mechanical properties (mechanical properties), test days/d (test time/d), weight 100g (mass 100 g), X is a dimensionless proportionality constant (X is a dimensionless proportionality constant), and the like. The error word is determined according to expert advice, for example, in the classification of rock mass, in the road and railway tunnel, the rock mass is classified into I-V class (engineering rock mass classification standard (GB/T50218)), in the hydraulic and hydroelectric engineering, the rock mass is classified into I-V class (hydraulic and hydroelectric engineering geological survey standard (GB 50487)), and the use of the technical terms is different under different use backgrounds.
Alternatively, the "normative word" in the present invention is defined as follows:
the science and technology theory accords with the words of each subject noun and related standard published by national natural science noun approval committee, and the clear standard words and usage in the Chinese academy journal (compact disc edition) retrieval and evaluation data standard.
For ease of understanding, the common misphrase "X is a dimensionless proportionality constant" in scientific papers is taken as an example for illustration. "dimensionless" is input to "error word", "dimensionless one" is input to "canonical word", and the system records and stores.
(3) Extracting text in the Word document, and preprocessing the text in the Word document, wherein the preprocessing comprises the steps of Word segmentation, stop Word removal and the like, so as to obtain key words in the text.
Specifically, the text is segmented and stop words are removed, so that [ "X", "dimensionless", "proportional", "constant" ] is obtained.
(4) Generating TF-IDF feature vectors, training Word2Vec models, and combining the TF-IDF feature vectors with Word2Vec Word vectors through weighted average to form a new vector representation.
By the TF-IDF feature vector solving method described in the above embodiment, the TF-IDF vector of "wrong word" i.e. "dimensionless" in this example is obtained as [0.5,0.2,0.3].
The Word2Vec Word vector of "dimensionless" found using the Word2Vec model is [0.5, -0.1,0.2].
In this embodiment, the alpha is 0.5, that is, the contribution of the alpha and the contribution of the alpha are equal, and the weighted fusion vector is:
+
Further, normalization processing is performed:
。
(5) And performing matching operation by using the fused vector representation. When the matched 'error word' is searched, the corresponding 'standard word' is found in the database for replacement.
In this embodiment, the "dimensionless" is replaced with "dimension one".
It should be noted that, according to the database query result, it is determined whether to replace the term in the original text. If the corresponding error word is found, the substitution is executed, and if the error word is not found, the error word is kept as it is or other preset strategies are adopted.
(6) According to the actual matching and replacing effects, model parameters, feature fusion methods and the like are adjusted and optimized, so that the accuracy and efficiency of operation are improved.
Specifically, if "dimensionless" is found to be replaced by error, the model can be retrained by adjusting the value of the weight coefficient alpha, or reselecting other parameters of the TF-IDF and Word2Vec models.
(7) The editors confirm the words to be replaced by the system one by one, and the editors finish modification and save after confirmation.
Specifically, the professional examines and confirms the result of the automatic replacement, ensures the accuracy of the replacement, and the staff finds that "dimensionless" in "X is a proportionality constant of dimensionless" is correctly replaced with "dimensionless one" in the examination process, thus confirming the replacement. If the replacement is incorrect, the staff will make corrections and provide feedback to the system to optimize the replacement algorithm.
As an optional embodiment, the invention also provides a specific process for text editing by using the scientific paper standardization intelligent detection and correction system, which is as follows:
(1) The data input is that the 'wrong word' and the 'normative word' in the experience rule are input into a preset database according to classification (professional name word class, common convention expression class, standardized quantity name word class, error prone word class and the like);
(2) Opening the document, importing the document to be edited.
(3) The information modification comprises the steps of extracting text information in a Word document by a system, preprocessing the text in the Word document, comparing the extracted data information with 'error words' in a database, directly skipping if no 'error words', and replacing 'error words' with 'standard words' if 'error words' occur.
(4) And confirming modification, namely confirming modification content, and automatically adding the confirmed modification record into a data information base for storage and backup by a program.
(5) And (5) checking the modification record, namely storing and inquiring the confirmed modification information.
It should be noted that the above system is only based on the text correction method provided by the present invention, and a feasible text editing system is designed, so that a developer can divide different functional modules of the system according to actual needs, and design a more graceful man-machine interface, but only by adopting the text correction method provided by the present invention, the text correction method is substantially within the protection scope of the present invention.
FIG. 4 is a schematic diagram of a text correction apparatus according to the present invention, as shown in FIG. 4, the apparatus includes a document processing module 410, a feature extraction module 420, a feature fusion module 430, a matching module 440, and a text replacement correction module 450;
A document processing module 410, configured to pre-process a current document to construct a vocabulary of the current document;
The feature extraction module 420 is configured to input a vocabulary of a current document into a Word2Vec model that is trained in advance, extract Word2Vec Word vectors of each Word in the vocabulary, and calculate TF-IDF values corresponding to each Word in each document by using a TF-IDF algorithm under a preset document set, so as to construct TF-IDF feature vectors of each Word;
The feature fusion module 430 is configured to perform vector feature fusion on the TF-IDF feature vector and Word2Vec Word vector of each Word to form a vector representation of each Word;
The matching module 440 is configured to match the corresponding target error word in a preset database according to the vector representation of each word, where the preset database includes multiple error words and standard words that are in one-to-one correspondence, the error words are words that do not meet a preset standard, the standard words are words that meet the preset standard, and the preset standard is determined according to an industry standard and expert rules;
And the text replacement correction module 450 is configured to replace each word with a canonical word corresponding to the target error word, so as to correct the text of the current document.
It should be noted that, when the text correction apparatus provided in the embodiment of the present invention is specifically executed, the text correction method described in any one of the above embodiments may be executed, which is not described in detail in this embodiment.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include a processor (processor) 510, a communication interface (communications interface) 520, a memory (memory) 530, and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 complete communication with each other through the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the text correction method.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the text correction method provided by the above embodiments.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the text correction method provided by the above embodiments.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.