CN112528654A

Movatterモバイル変換

Info

Publication number: CN112528654A
Application number: CN202011479380.3A
Authority: CN
Inventors: 吕少科; 蒋宏飞; 宋旸; 邵睿; 张振
Original assignee: Zuoyebang Education Technology Beijing Co Ltd
Current assignee: Zuoyebang Education Technology Beijing Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-19
Anticipated expiration: 2040-12-15
Also published as: CN112528654B

Abstract

Translated fromChinese

本发明属于计算机信息处理技术领域，提供一种自然语言处理方法、装置、电子设备及计算机可读介质，该方法包括：对文本数据中的文字进行分词处理以获取文字和/或词汇；将所述文本数据和其对应的领域属性输入文字向量模型中，获取字向量；将所述文本数据和其对应的领域属性输入词汇向量模型中，获取词向量；基于所述文本数据确定所述文字对应的第一权重和/或所述词汇对应的第二权重；通过所述字向量、所述第一权重和/或所述词向量、所述第二权重确定所述文本数据的句子语义向量；基于所述句子语义向量对所述实时文本数据进行自然语言处理。本发明能够有效提升句子的语义表达能力。

The present invention belongs to the technical field of computer information processing, and provides a natural language processing method, device, electronic device and computer-readable medium. The method includes: performing word segmentation processing on words in text data to obtain words and/or vocabulary; Input the text data and its corresponding domain attributes into the word vector model to obtain word vectors; input the text data and its corresponding domain attributes into the vocabulary vector model to obtain word vectors; determine the text corresponding to the text based on the text data The first weight and/or the second weight corresponding to the vocabulary; the sentence semantic vector of the text data is determined by the word vector, the first weight and/or the word vector, and the second weight; Perform natural language processing on the real-time text data based on the sentence semantic vector. The present invention can effectively improve the semantic expression ability of sentences.

Description

Natural language processing method and device and electronic equipment

Technical Field

The invention belongs to the technical field of computer information processing, is particularly suitable for the field of semantic recognition of machines, and particularly relates to a natural language processing method, a natural language processing device, electronic equipment and a computer readable medium.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Natural language processing is not a general study of natural language but is directed to the development of computer systems, and particularly software systems therein, that can efficiently implement natural language communications. It is thus part of computer science. In practice, natural language processing, i.e., implementing man-machine natural language communication, or implementing natural language understanding and natural language generation, is very difficult. A chinese text or a string of chinese characters (containing punctuation, etc.) may have multiple meanings. It is a major difficulty and obstacle in natural language understanding. Conversely, a same or similar meaning may be represented by multiple Chinese texts or multiple Chinese strings.

Modern NLP algorithms are based on machine learning, in particular statistical machine learning. The machine learning paradigm is distinct from the general prior attempted language processing. The implementation of language processing tasks typically involves large sets of rule codes that are directly hand-written. The general method is that a machine learning model is trained based on a common language material base, a segment of text data containing natural language is subjected to word segmentation processing, the result of the word segmentation processing is input into the trained machine learning model, and then semantic recognition is carried out based on word vectors.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to solve the dilemma in the natural language processing in the prior art, because the natural language processing in the prior art is carried out based on the word segmentation mode, and in the actual Chinese, a single word can also express a plurality of meanings; moreover, the natural language processing models in the prior art are trained based on a wide corpus, and an attempt is made to obtain a natural language processing model suitable for all scenes. The above two disadvantages make the computation of the prior art natural language processing model slow, and the analysis results given in some scenarios are not accurate enough.

In order to solve the above technical problem, an aspect of the present invention provides a natural language processing method, including: performing word segmentation processing on characters in the text data to obtain characters and/or vocabularies; inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight; and performing natural language processing on the real-time text data based on the sentence semantic vector.

According to a preferred embodiment of the present invention, further comprising: extracting sentence semantic vectors of a plurality of preset text data in a database; comparing the similarity of the text data and the preset text data based on the sentence semantic vector; and determining target text data from the plurality of preset text data according to the similarity comparison result.

According to a preferred embodiment of the present invention, further comprising: training a deep neural network model based on a plurality of corpora with domain attributes to generate the character vector model; and training a shallow neural network model based on a plurality of corpora with domain attributes to generate the vocabulary vector model.

According to a preferred embodiment of the present invention, the word segmentation processing is performed on the words in the text data to obtain words and/or vocabularies, and includes: acquiring a word segmentation dictionary; performing word segmentation on the text data based on the word segmentation dictionary to generate a vocabulary network, wherein the vocabulary network is a directed acyclic graph; determining the vocabulary based on the vocabulary network.

According to a preferred embodiment of the present invention, determining the vocabulary based on the vocabulary network comprises: determining a maximum probability path in the vocabulary network based on a dynamic programming algorithm; determining the vocabulary based on the most probable path.

According to a preferred embodiment of the present invention, after performing word segmentation processing on words in text data to obtain words and/or vocabularies, the method further includes: determining a domain attribute of the text data based on the content of the text data; and/or determining a domain attribute of the text data based on a tag of the text data.

According to the preferred embodiment of the present invention, inputting the text data and the corresponding domain attribute into a word vector model to obtain a word vector, includes: and inputting the text data and the corresponding domain attributes into the trained BERT model to generate a word vector.

According to the preferred embodiment of the present invention, inputting the text data and the corresponding domain attribute into a vocabulary vector model, and obtaining a word vector, includes: and inputting the text data and the corresponding domain attribute into a trained Word2vec model to generate a Word vector.

According to a preferred embodiment of the present invention, determining a first weight corresponding to the word and/or a second weight corresponding to the vocabulary based on the text data comprises: determining the first weight and/or the second weight based on an inverse document frequency corresponding to the words and/or the vocabulary in the text data.

According to a preferred embodiment of the present invention, determining a sentence semantic vector of the text data by the word vector, the first weight and/or the word vector, the second weight includes: and splicing the word vector and/or the word vector according to the first weight and/or the second weight to generate the sentence semantic vector.

A second aspect of the present invention provides a natural language processing apparatus, including: the word segmentation module is used for carrying out word segmentation processing on characters in the text data to obtain characters and/or words; the character module is used for inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; the vocabulary module is used for inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; the weighting module is used for determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; a vector module, configured to determine a sentence semantic vector of the text data according to the word vector, the first weight and/or the word vector, and the second weight; and the semantic module is used for carrying out natural language processing on the real-time text data based on the sentence semantic vector.

A third aspect of the invention proposes an electronic device comprising a processor and a memory for storing a computer-executable program, which, when executed by the processor, performs the method.

The fourth aspect of the present invention also provides a computer-readable medium storing a computer-executable program, which when executed, implements the method.

According to the natural language processing method, the natural language processing device, the electronic equipment and the computer readable medium, words and/or vocabularies are obtained by performing word segmentation processing on the words in the text data; inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight; the natural language processing method based on the sentence semantic vector can effectively improve the semantic expression capability of the sentence, greatly enhance the semantic expression capability of the natural language processing task at the sentence level on the premise of ensuring the simplicity and the high efficiency of the natural language processing task, and achieve the purpose of generating positive action on downstream tasks.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Fig. 1 is a system block diagram of a natural language processing method and apparatus according to an embodiment of the present invention.

FIG. 2 is a flow diagram of a natural language processing method of one embodiment of the invention.

FIG. 3 is a flow diagram of a natural language processing method of one embodiment of the invention.

FIG. 4 is a flow diagram of a natural language processing method of one embodiment of the invention.

Fig. 5 is a block diagram of a natural language processing apparatus according to an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of an electronic device of one embodiment of the invention;

fig. 7 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention.

Detailed Description

In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.

The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.

The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

In order to solve the technical problem, the invention provides a natural language processing method, a natural language processing device, an electronic device and a computer readable medium, which are used for performing word segmentation processing on characters in text data to obtain characters and/or vocabularies; inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight; the natural language processing method based on the sentence semantic vector can effectively improve the semantic expression capability of the sentence, greatly enhance the semantic expression capability of the natural language processing task at the sentence level on the premise of ensuring the simplicity and the high efficiency of the natural language processing task, and achieve the purpose of generating positive action on downstream tasks.

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

Fig. 1 is a system block diagram illustrating a natural language processing method and apparatus according to an example embodiment.

As shown in fig. 1, thesystem architecture 10 may include

terminal devices

101, 102, 103, anetwork 104, and aserver 105. Thenetwork 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and theserver 105.Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with theserver 105 via thenetwork 104 to receive or send messages or the like. Various communication client applications, such as a learning-assisted application, a web browser application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

In one embodiment, the

terminal device

101, 102, 103 may, for example, perform word segmentation on words in the text data to obtain words and/or vocabularies; the

terminal devices

101, 102, and 103 may, for example, input the text data and the corresponding domain attribute into a word vector model to obtain a word vector; the

terminal devices

101, 102, and 103 may, for example, input the text data and the corresponding domain attribute into a vocabulary vector model to obtain a word vector; the

terminal device

101, 102, 103 may determine, for example based on the text data, a first weight corresponding to the word and/or a second weight corresponding to the vocabulary; the

terminal device

101, 102, 103 may determine a sentence semantic vector of the text data, e.g. by the word vector, the first weight and/or the word vector, the second weight; the

terminal device

101, 102, 103 may perform natural language processing on the real-time text data, e.g. based on the sentence semantic vector. The word vector model and the vocabulary vector model may be located locally on the

terminal device

101, 102, 103 or on theserver 105 side.

Theserver 105 may be a server that provides various services, such as a background management server that supports video learning-like websites browsed by users using the

terminal devices

101, 102, 103. The background management server may perform natural language processing on the received text data and feed back the processing result to the

terminal devices

101, 102, and 103.

In one embodiment, theserver 105 may obtain text data, for example, from the

terminal devices

101, 102, 103, and then perform word segmentation on words in the text data to obtain words and/or vocabularies; theserver 105 may, for example, input the text data and the corresponding domain attribute into a word vector model to obtain a word vector; theserver 105 may, for example, input the text data and the corresponding domain attribute into a vocabulary vector model to obtain a word vector;server 105 may determine a first weight corresponding to the word and/or a second weight corresponding to the vocabulary, e.g., based on the text data; theserver 105 may determine a sentence semantic vector of the text data, e.g. by the word vector, the first weight and/or the word vector, the second weight; theserver 105 may perform natural language processing on the real-time text data, for example, based on the sentence semantic vector.

Theserver 105 may also extract, for example, sentence semantic vectors for a plurality of preset text data in the database; theserver 105 may also compare the similarity of the text data and the plurality of preset text data, for example, based on the sentence semantic vector; theserver 105 may also determine target text data from the plurality of preset text data, for example, according to the similarity comparison result.

Theserver 105 may also train the deep neural network model, for example, based on a plurality of corpora with domain attributes, to generate the word vector model; theserver 105 may also train the shallow neural network model, for example, based on a plurality of corpora with domain attributes, generating the vocabulary vector model.

Theserver 105 may be a server of an entity, or may be composed of a plurality of servers, for example, a part of theserver 105 may perform training of a machine learning model, for example, to generate a word vector model and a vocabulary vector model; and a portion of theserver 105 may also perform natural language processing on the text data, for example.

It should be noted that the natural language processing method provided by the embodiment of the present disclosure may be executed by theserver 105 or the

terminal devices

101, 102, and 103, and accordingly, the natural language processing apparatus may be disposed in theserver 105 or the

terminal devices

101, 102, and 103.

FIG. 2 is a flow diagram illustrating a natural language processing method in accordance with an exemplary embodiment. The naturallanguage processing method 20 includes at least steps S202 to S212.

As shown in fig. 2, in S202, word segmentation processing is performed on words in the text data to obtain words and/or vocabularies. The text data may be text data from a user in a human-computer interaction process or text data converted from voice data of the user, and the text data may include one sentence or a plurality of sentences composed of natural languages.

In the present disclosure, the word segmentation process may be a chinese word segmentation process, and word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Existing word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. Whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. In the present invention, the text data may be participled by one or more of the above methods to generate a plurality of Chinese characters and words.

In one embodiment, further comprising: determining a domain attribute of the text data based on the content of the text data; and/or determining a domain attribute of the text data based on a tag of the text data. The domain attribute of the text data can be obtained from a dialog request of a man-machine dialog, and the domain attribute can also be determined from characters after word segmentation of the text data, which is not limited in the present disclosure.

In S204, the text data and the corresponding domain attribute are input into a word vector model, and a word vector is obtained. The text data and its corresponding domain attributes may be input into a trained BERT model, for example, to generate a word vector.

In one embodiment, further comprising: training a deep neural network model based on a plurality of corpora with domain attributes to generate the character vector model; the deep neural network model may be a BERT series deep neural network model, and may specifically include BERT models, ALBERT models, and the like. BERT is a method of pre-training language representation that trains a generic "language understanding" model over a large corpus of text (wikipedia) and then uses this model to perform the desired NLP task. BERT is the first unsupervised, deep bi-directional system used on pre-trained NLPs. Unsupervised means that BERT only needs to be trained with plain text corpora, since a huge amount of text corpora is publicly available in networks of various languages. The pre-training representation may be context-free or context-dependent, and the context-dependent representation may be unidirectional or bidirectional.

In the embodiment of the invention, when the BERT model is trained, the corpus data is divided according to the domain attributes when being input, such as the 'mathematical' domain, the 'chemical' domain and the like, and the corpus data in different attribute domains train different BERT models to generate character vector models aiming at different domain attributes.

In S206, the text data and the corresponding domain attribute are input into a vocabulary vector model, and a word vector is obtained. The text data and its corresponding domain attributes may be input into a trained Word2vec model, for example, to generate a Word vector.

In one embodiment, further comprising: and training a shallow neural network model based on a plurality of corpora with domain attributes to generate the vocabulary vector model. The shallow neural network model may be a Word2vec model, where Word2vec is a group of related models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.

In the embodiment of the invention, when carrying out word2vec model training, and when inputting corpus data, the corpus data are also divided according to domain attributes, such as a 'mathematical' domain, a 'chemical' domain and the like, and the corpus data in different attribute domains train different word2vec models to generate vocabulary vector models aiming at different domain attributes.

In S208, a first weight corresponding to the word and/or a second weight corresponding to the vocabulary is determined based on the text data. The first weight and/or the second weight may be determined based on an inverse document frequency corresponding to the words and/or the vocabulary in the text data.

Among them, inverse document frequency (TF-IDF) is a statistical method to evaluate the importance of a word or a word to one of a set of documents or a corpus of documents. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are used as the importance ratings for words or phrases in the present invention.

In S210, a sentence semantic vector of the text data is determined by the word vector, the first weight and/or the word vector, the second weight. The word vector and/or the word vector may be spliced, for example, according to the first weight and/or the second weight to generate the sentence semantic vector.

In a specific embodiment, the text data is: "today is cloudy".

1) "today is cloudy" is divided by words: today is cloudy;

2) obtaining each word vector: today: [0.1,0.2,0.3 ]; the method comprises the following steps: [0.4,0.5,0.6 ]; in cloudy days: [0.7,0.8,0.9 ];

3) the sentence vector corresponding to the word vector segmentation mode of the sentence is as follows: (idf (today) [0.1,0.2,0.3] + idf (is) [0.4,0.5,0.6] + idf (cloudy day) [0.7,0.8,0.9])/3, and as a result, a three-dimensional vector (which can be a vector with more dimensions, and the application is not limited thereto);

4) "today is cloudy" is divided by words: today is cloudy;

5) obtaining each word vector: today: [0.1,0.2,0.3 ]; day: [0.12,0.82,0.92 ]; the method comprises the following steps: [0.4,0.5,0.6 ]; yin: [0.7,0.8,0.9 ]; (note: word vector and 2 above) there is no equal limit to the length of the word vector, and preferably, the length between the word and word vectors is not too different.

6) The sentence vector corresponding to the word vector segmentation mode of the sentence is as follows: (idf (d) × [0.1,0.2,0.3] + idf (day) × [0.12,0.82,0.92] + idf (is) × [0.4,0.5,0.6] + idf (negative) × [0.7,0.8,0.9] + idf (day) × [0.12,0.82,0.92])/5, the result is a three-dimensional vector (which may be a vector of more dimensions, but the application is not limited thereto);

7) and (3) splicing vectors of all granularity sentences: (sentence vector 1, sentence vector 2 …), the final dimension being the sum of the lengths of the sentence vectors.

In S212, natural language processing is performed on the real-time text data based on the sentence semantic vector.

In one embodiment, further comprising: extracting sentence semantic vectors of a plurality of preset text data in a database; comparing the similarity of the text data and the preset text data based on the sentence semantic vector; and determining target text data from the plurality of preset text data according to the similarity comparison result. For example, a user inputs a text data, the text data can be used for mathematics application questions, the search is carried out in a question bank according to the data input by the user, cosine distance calculation can be carried out according to sentence semantic vectors of the text data and all questions in the question bank to serve as the similarity of two sentences, then the question which is most similar to the question in the text data and a corresponding solution method are determined in the question bank, and the result is returned to the user.

According to the natural language processing method, word segmentation processing is carried out on characters in text data to obtain characters and/or words; inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight; the natural language processing method based on the sentence semantic vector can effectively improve the semantic expression capability of the sentence, greatly enhance the semantic expression capability of the natural language processing task at the sentence level on the premise of ensuring the simplicity and the high efficiency of the natural language processing task, and achieve the purpose of generating positive action on downstream tasks.

It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

Fig. 3 is a flow diagram illustrating a natural language processing method in accordance with another exemplary embodiment. Theprocess 30 shown in fig. 3 is a detailed description of S202 "performing word segmentation processing on words in text data to obtain words and/or phrases" in the process shown in fig. 2.

As shown in fig. 3, in S302, a word segmentation dictionary is acquired.

In S304, the text data is segmented based on the segmentation dictionary to generate a vocabulary network, and the vocabulary network is a directed acyclic graph. A directed acyclic graph refers to a loop-free directed graph. If there is a non-directed acyclic graph, and point A can go from B to C and back to A, forming a ring. Changing the edge direction from C to A from A to C, the directed acyclic graph is changed. The number of spanning trees of the directed acyclic graph is equal to the in-degree product of nodes with non-zero in-degree.

In S306, a most probable path in the lexical network is determined based on a dynamic planning algorithm. Dynamic programming algorithms are typically used to solve problems with some optimal nature. In such problems, there may be many possible solutions. Each solution corresponds to a value, and we want to find the solution with the optimal value. The dynamic programming algorithm is similar to a divide-and-conquer method, and the basic idea is to decompose the problem to be solved into a plurality of sub-problems, solve the sub-problems first, and then obtain the solution of the original problem from the solution of the sub-problems.

In a lexical network, after a state of one phase is given, a choice (action) to evolve from that state to a state of the next phase is called a decision. The sequence of decisions made at each stage is called a policy. For each actual multi-stage decision process, there is a range of limits on the policies that can be selected, this range being called the set of allowed policies. In the invention, the optimal strategy is positioned on the maximum probability path of all word segmentation words.

More specifically, in the present invention, all the word segmentation paths are searched out through the vocabulary network. The path of the word segmentation is then the path with the highest probability, and the probability of each path is the product of the probabilities of all words in the path.

In S308, the vocabulary is determined based on the most probable path.

Fig. 4 is a flowchart illustrating a natural language processing method according to another exemplary embodiment. Theflow 40 shown in fig. 4 is a detailed description of the overall process of the natural language processing method of the present invention.

As shown in fig. 4, in S402, corpus data is collected, and a plurality of corpus data sets are generated. Open domain corpora are collected.

In S404, whether the processing of the data in each corpus data set is completed or not is determined.

In S406, the sentence is obtained by washing and screening. And screening and cleaning the corpus data to obtain text sentences.

In S408, the word/phrase segmentation process counts the IDFs thereof, respectively. And segmenting the text sentence according to the character granularity and the word granularity respectively to obtain the segmented text sentence. And respectively counting the idf of the segmented characters and words.

In S410, the word/word vectors are trained for the sentences into which the words/words are segmented, respectively. And training the sentences segmented by the characters and the words by adopting a neural network respectively to obtain corresponding character/word vectors, and storing the corresponding character/word vectors.

In S412, the model is saved.

In S414, the word/phrase vectors are obtained, and IDF weighted and concatenated to obtain the sentence vector. For a new sentence, after words are segmented, corresponding word vectors are obtained, and after tf-idf weighting, splicing is carried out on the word vectors, so that a sentence semantic vector is generated and is used as a representation of the sentence.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 5 is a block diagram illustrating a natural language processing device according to an example embodiment. As shown in fig. 5, the naturallanguage processing device 50 includes: aword segmentation module 502, atext module 504, avocabulary module 506, aweight module 508, avector module 510, and asemantic module 512.

Theword segmentation module 502 is configured to perform word segmentation on words in the text data to obtain words and/or vocabularies;

thetext module 504 is configured to input the text data and the corresponding domain attribute into a text vector model to obtain a word vector;

thevocabulary module 506 is used for inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector;

theweight module 508 is configured to determine a first weight corresponding to the word and/or a second weight corresponding to the vocabulary based on the text data;

thevector module 510 is configured to determine a sentence semantic vector of the text data by the word vector, the first weight and/or the word vector, and the second weight;

thesemantic module 512 is configured to perform natural language processing on the real-time text data based on the sentence semantic vector.

According to the natural language processing device disclosed by the invention, word segmentation processing is carried out on characters in text data so as to obtain characters and/or vocabularies; inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector; inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector; determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data; determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight; the natural language processing method based on the sentence semantic vector can effectively improve the semantic expression capability of the sentence, greatly enhance the semantic expression capability of the natural language processing task at the sentence level on the premise of ensuring the simplicity and the high efficiency of the natural language processing task, and achieve the purpose of generating positive action on downstream tasks.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which includes a processor and a memory, where the memory stores a computer-executable program, and when the computer program is executed by the processor, the processor executes a vehicle intelligent assistance pushing method based on rotation angle monitoring.

As shown in fig. 6, the electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.

The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.

The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).

Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for data exchange between the electronic device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.

It should be understood that the electronic device shown in fig. 6 is only one example of the present invention, and elements or components not shown in the above example may be further included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method.

Fig. 7 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention. As shown in fig. 7, a computer-readable recording medium stores a computer-executable program, and when the computer-executable program is executed, the method for vehicle intelligent assistance push based on rotation angle monitoring according to the present invention is implemented. The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system, and the present invention can also be implemented by a vehicle including at least a part of the above system or components. The invention can also be implemented by computer software for performing the method of the invention, for example, by control software executed by a microprocessor, an electronic control unit, a client, a server, etc. of the locomotive side. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, but may also be implemented in a distributed manner by hardware entities without specific details, for example, some method steps executed by the computer program may be executed at the locomotive end, and another part may be executed in the mobile terminal or the smart helmet, etc. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.

While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A natural language processing method, comprising:

performing word segmentation processing on characters in the text data to obtain characters and/or vocabularies;

inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector;

inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector;

determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data;

determining a sentence semantic vector of the text data through the word vector, the first weight and/or the word vector, and the second weight;

and performing natural language processing on the real-time text data based on the sentence semantic vector.

2. The natural language processing method according to claim 1, further comprising:

extracting sentence semantic vectors of a plurality of preset text data in a database;

comparing the similarity of the text data and the preset text data based on the sentence semantic vector;

and determining target text data from the plurality of preset text data according to the similarity comparison result.

3. The natural language processing method according to claim 1 or 2, further comprising:

training a deep neural network model based on a plurality of corpora with domain attributes to generate the character vector model;

and training a shallow neural network model based on a plurality of corpora with domain attributes to generate the vocabulary vector model.

4. A natural language processing method according to any one of claims 1 to 3, wherein performing word segmentation processing on words in text data to obtain words and/or vocabularies comprises:

acquiring a word segmentation dictionary;

performing word segmentation on the text data based on the word segmentation dictionary to generate a vocabulary network, wherein the vocabulary network is a directed acyclic graph;

determining the vocabulary based on the vocabulary network.

5. The natural language processing method of any one of claims 1 to 4 wherein determining the vocabulary based on the vocabulary network comprises:

determining a maximum probability path in the vocabulary network based on a dynamic programming algorithm;

determining the vocabulary based on the maximum probability path;

optionally, after performing word segmentation processing on the words in the text data to obtain words and/or vocabularies, the method further includes:

determining a domain attribute of the text data based on the content of the text data; and/or

Determining a domain attribute of the text data based on the label of the text data.

6. The natural language processing method according to any one of claims 1 to 5, wherein inputting the text data and the corresponding domain attribute into a word vector model to obtain a word vector comprises:

inputting the text data and the corresponding domain attributes into a trained BERT model to generate a word vector;

optionally, inputting the text data and the domain attribute corresponding to the text data into a vocabulary vector model, and obtaining a word vector, including:

and inputting the text data and the corresponding domain attribute into a trained Word2vec model to generate a Word vector.

7. The method of claim 1, wherein determining a first weight corresponding to the word and/or a second weight corresponding to the vocabulary based on the text data comprises:

determining the first weight and/or the second weight based on an inverse document frequency corresponding to the words and/or the vocabularies in the text data;

optionally, determining a sentence semantic vector of the text data by the word vector, the first weight and/or the word vector, the second weight includes:

and splicing the word vector and/or the word vector according to the first weight and/or the second weight to generate the sentence semantic vector.

8. A natural language processing apparatus, comprising:

the word segmentation module is used for carrying out word segmentation processing on characters in the text data to obtain characters and/or words;

the character module is used for inputting the text data and the corresponding field attribute into a character vector model to obtain a character vector;

the vocabulary module is used for inputting the text data and the corresponding field attribute into a vocabulary vector model to obtain a word vector;

the weighting module is used for determining a first weight corresponding to the characters and/or a second weight corresponding to the vocabularies based on the text data;

a vector module, configured to determine a sentence semantic vector of the text data according to the word vector, the first weight and/or the word vector, and the second weight;

and the semantic module is used for carrying out natural language processing on the real-time text data based on the sentence semantic vector.

9. An electronic device comprising a processor and a memory, the memory for storing a computer-executable program, characterized in that:

the computer program, when executed by the processor, performs the method of any of claims 1-10.

10. A computer-readable medium storing a computer-executable program, wherein the computer-executable program, when executed, implements the method of any of claims 1-10.