Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without undue burden are within the scope of the present application.
The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.
First, referring to fig. 1, fig. 1 is a schematic hardware structure of a reply sentence determining apparatus based on coarse semantics according to an embodiment of the present application. The rough semantic based reply sentence determining means 100 comprises at least oneprocessor 101, acommunication line 102, amemory 103 and at least onecommunication interface 104.
In this embodiment, theprocessor 101 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in the present application.
Communication line 102 may include a pathway to transfer information between the above-described components.
Thecommunication interface 104, which may be any transceiver-like device (e.g., antenna, etc.), is used to communicate with other devices or communication networks, such as ethernet, RAN, wireless local area network (wireless local area networks, WLAN), etc.
Thememory 103 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
In this embodiment, thememory 103 may be independently provided and connected to theprocessor 101 via thecommunication line 102.Memory 103 may also be integrated withprocessor 101. Thememory 103 provided by embodiments of the present application may generally have non-volatility. Thememory 103 is used for storing computer-executable instructions for executing the embodiments of the present application, and is controlled by theprocessor 101 to execute the instructions. Theprocessor 101 is configured to execute computer-executable instructions stored in thememory 103, thereby implementing the methods provided in the embodiments of the present application described below.
In alternative embodiments, computer-executable instructions may also be referred to as application code, which is not specifically limited in this application.
In alternative embodiments,processor 101 may include one or more CPUs, such as CPU0 and CPU1 in fig. 1.
In an alternative embodiment, the rough semantic based replysentence determination device 100 may include multiple processors, such as theprocessor 101 and theprocessor 107 in fig. 1. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In an alternative embodiment, if the replysentence determining apparatus 100 based on coarse semantics is a server, for example, it may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery network (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platform. The rough semantic based replysentence determining apparatus 100 may further include anoutput device 105 and aninput device 106. Theoutput device 105 communicates with theprocessor 101 and may display information in a variety of ways. For example, theoutput device 105 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. Theinput device 106 is in communication with theprocessor 101 and may receive user input in a variety of ways. For example, theinput device 106 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
The above-described replysentence determining apparatus 100 based on rough semantics may be a general-purpose device or a special-purpose device. The present embodiment does not limit the type of the replysentence determination apparatus 100 based on the rough semantics.
Second, it should be noted that the embodiments disclosed herein may acquire and process related data based on artificial intelligence techniques. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Finally, the reply sentence determining method based on the rough semantics can be applied to scenes such as telephone consultation, e-commerce sales, off-line entity sales, service popularization, agent telephone outbound, social platform popularization and the like. In the application, the method for determining the reply sentence based on the rough semantics is mainly described by taking a telephone consultation scene as an example, and the method for determining the reply sentence based on the rough semantics in other scenes is similar to the implementation mode in the telephone consultation scene and is not described herein.
The rough semantic-based reply sentence determination method disclosed in the present application will be described below:
referring to fig. 2, fig. 2 is a flow chart of a method for determining a reply sentence based on coarse semantics according to an embodiment of the present application. The method for determining the reply sentence based on the rough semantics comprises the following steps:
201: and acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user.
In this embodiment, the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest. In short, the previous round of voice information is the last sentence the user speaks before speaking the voice information at the current time.
For example, the previous round of voice information may be determined by querying historical dialogue data that records dialogue data generated by a dialogue event to which the voice information of the current time of the user belongs before the current time, by the occurrence time of the voice information of the current time of the user. Specifically, two related sentence queues may be stored in the historical dialogue data, where one queue is used to store user sentences issued by the user, and the other pair is used to store reply sentences made by the AI to the user's sentences. Meanwhile, each user sentence in the user sentence queue and each reply sentence in the reply sentence queue contain a dialogue identifier and dialogue occurrence time, and the user sentences and the reply sentences with the same identifier are combined into a question-answer pair through the dialogue identifier, namely, the reply sentences with the same dialogue identifier are answers to the user sentences. Therefore, the logicality of questions and answers in the historical dialogue data can be guaranteed, and meanwhile, sentences of the user and AI are stored separately, so that the search is facilitated.
Therefore, in the present embodiment, by querying the user sentence queue, it is possible to determine, as the previous round of voice information, the voice information whose dialogue occurrence time is earlier than the voice information at the current time of the user and whose absolute value of the difference between the occurrence time and the occurrence time of the first sentence is the smallest.
202: and carrying out rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information.
In this embodiment, coarse semantic features may be understood as semantic features that contain high-level abstract information in the previous round of speech information. Illustratively, multiple high order parallel sequences may be obtained by actively constructing one high order coarse sequence representation (high level coarse sequence representation), which is then analyzed. And then, a low-order rough sequence is generated through a layered structure, so that information in a plurality of high-order parallel sequences flows to the low-order rough sequence, the synchronous extraction of key information and rough information in voice information is realized, and a plurality of layers of information can be synchronously embodied. Meanwhile, after the method is converted into the low-order rough sequence, the model for generating the reply sentence can also better memorize and understand the long-term content, so that meaningful replies closely related to the theme are generated, and the user experience is improved.
Exemplary, this embodiment provides a method for extracting coarse semantics of voice information according to a previous round of voice information to obtain coarse semantic features corresponding to the voice information, as shown in fig. 3, where the method includes:
301: and detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information.
In this embodiment, the detection process may be to perform word segmentation after performing text conversion on the previous round of voice information, and then obtain all the words that can be obtained through word segmentation processing as the at least one first word. Meanwhile, each of the at least one first word may include a word tag, which may be part-of-speech information of the corresponding first word, for example: nouns, verbs, named entities, etc.
Thus, in this embodiment, named entity information in text of text converted by conditional random field model (Conditional Random Fields, CRF) may be extracted, and the type of named entity, such as name of person or organization, may be marked by CRF. And then, word segmentation and Part-Of-Speech (POS) marking are carried out on the text by using a Part-Of-Speech marking tool, and nouns and verbs in the text are extracted. The combined result of the CRF and POS is needed in this process because POS is identified only in terms, and the CRF may be a complete phrase, such as: i work at Shanghai's complex denier university, the CRF can completely identify the entity of the organization name "Shanghai's complex denier university", while POS can only identify nouns: "Shanghai", "double denier" and "university". So in terms of processing entity words, if the result of the POS is contained by the CRF, the result of the CRF will be used preferentially, and in terms of verbs, only the result of the POS will be used. Thus, the first word containing the part-of-speech information label can be obtained.
In an alternative embodiment, if the language used by the user is english, a set of verbs and named entities in a corresponding field may be pre-constructed, then verbs and named entities in the original sentence are extracted in a matching manner, and extracting the nouns in english may also use POS to perform noun recognition and extraction, so as to obtain the first word containing the part-of-speech information label.
302: the temporal information of the previous round of voice information is determined according to at least one first word.
In this embodiment, at least one first word obtained by word segmentation may be input to a gate-controlled loop unit (Gate Recurrent Unit, GRU) encoder for encoding, so as to obtain a second hidden layer feature vector. The second hidden layer feature vector is then input to a multi-layer perceptron (MultiLayer Perceptron, MLP) to yield a linear output result. And finally, inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of voice information.
Specifically, the GRU is structured as shown in FIG. 4, which includes a reset gate r
t Update door z
t Candidate memory cell
And a current time memory unit h
t 。
Specifically, the gate r is resett Can be represented by the formula (1):
rt =σ(Wr Xt +Ur ht-1 +br ).........①
Wherein σ is the activation function, Wr And Ur Is reset gate rt Corresponding parameter matrix, initialized value is random, and can be obtained through training of modelNew values. br Is reset gate rt The corresponding bias is also trainable.
Further, update door zt Can be represented by the formula (2):
zt =σ(Wz Xt +Uz ht-1 +bz ).........②
wherein W isz And Uz Is to update the door zt The corresponding parameter matrix, the initialized value is random, and a new value can be obtained through training the model. bz Is to update the door zt The corresponding bias is also trainable.
Further, candidate memory cells
Can be represented by the formula (3):
wherein tanh is an activation function, and W and U are candidate memory cells
The corresponding parameter matrix, the initialized value is random, and a new value can be obtained through training the model. b is a candidate memory cell->
The corresponding bias is also trainable.
Further, the current time memory cell ht Can be represented by the formula (4):
wherein z ist Is weight, is trainable.
In this embodiment, the structure of the MLP is shown in fig. 5, and the MLP is composed of two Linear layers Linear and a ReLu activation function, after the Linear output result is output through the last Linear layer, the Linear output result is input into the softmax function again for multi-label classification, and finally, the temporal classifier judges the temporal state of the current sentence. Therefore, misidentification and missing identification caused by simply using independent words such as 'over', 'on' and the like in the traditional tense identification are avoided. Such as: the voice message "i am running" is now in progress, but is not recognized in the conventional recognition method because it does not contain independent words such as "pass", "go", "have been" and the like.
303: and adding the temporal information into the word label of each first word to obtain at least one second word corresponding to at least one first word one by one.
In brief, in this embodiment, the second word is the first word with the corresponding temporal information of the voice information added to the word label. Therefore, the second word also carries corresponding part-of-speech information and temporal information of the voice on the basis of carrying the information of the voice, so that the reply sentence generated later is more accurate.
304: and inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information corresponding to the at least one second word one by one and at least one first hidden layer state characteristic vector corresponding to the at least one second word one by one.
In this embodiment, the coarse encoder may be a GRU encoder. Specifically, at the time of encoding, at least one second word is sequentially input into the GRU encoder, and the encoder outputs the corresponding rough context information and the first hidden layer feature vector. In the encoding process, besides the second word of the current encoding, the first hidden layer feature vector output in the last encoding process can also be used as the input of the current encoding. That is, when encoding the xth second word, and the xth-1 first hidden layer state feature vector may be input to the GRU encoder to obtain the xth rough context information and the xth first hidden layer state feature vector. And when x=1, since the 0 th second word does not exist, only the 1 st second word is input into the GRU encoder for encoding.
305: and inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processing to obtain rough semantic features of the voice information.
In this embodiment, when extracting the rough semantic features, the importance degree of each second word obtained by splitting is different for the speech information. Thus, before the rough context information is input to the rough decoder, the information may be subjected to attention processing to acquire the importance degree of the respective rough context information.
Illustratively, each second word corresponds to one hidden layer feature vector, i.e. the first hidden layer feature vector, in the encoder, and in brief, how many pieces of rough context information there are how many first hidden layer feature vectors. Thus, the rough context information can be input into the decoder, which calculates the similarity between the feature vector of the current decoding process (the output of the current decoding process of the decoder) and the hidden layer state feature obtained by decoding the input rough context information at the time of decoding. Thus, a similarity value is calculated for each piece of rough context information, and then the similarity values are normalized to obtain a weight corresponding to each piece of rough context information. And multiplying the weight corresponding to each piece of rough context information with the hidden layer state characteristic vector obtained by the rough context information input encoder to obtain attention characteristic, and adding the attention characteristic with the output characteristic vector obtained when the rough context information is input into the decoder to obtain the final characteristic obtained when the rough context information is input into the decoder.
Based on this, the present embodiment provides a method for inputting at least one first hidden layer state feature vector of at least one piece of rough context information into a rough decoder to perform multiple decoding processes, so as to obtain rough semantic features of voice information, as shown in fig. 6, the method includes:
601: in the ith decoding process, the input feature vector Ai Input to a coarse decoder to obtain an output feature vector Bi 。
In the present embodiment, i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one piece of rough context information, j is an integer greater than or equal to 1, and when i=1, the feature vector a is inputi Is the 1 st rough context information of the at least one rough context information.
602: calculate the output eigenvector Bi And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vectori Similarity D betweeni 。
In the present embodiment, the output feature vector B may be calculatedi And the i first hidden layer state feature vector Ci Cosine similarity between them to obtain similarity Di 。
603: for similarity Di Normalization processing is carried out to obtain an input feature vector Ai Weight E of (2)i 。
In the present embodiment, the similarity degree D may bei Normalizing the input softmax function to obtain an input feature vector Ai Weight E of (2)i 。
604: weight Ei And the ith first hidden layer state characteristic vector Ci Multiplying to obtain a weight feature vector Fi 。
605: will weight the feature vector Fi And output feature vector Bi Adding to obtain a target output feature vector Gi 。
606: outputting the target feature vector Gi Input feature vector A as the (i+1) -th decoding processi+1 And (3) performing the (i+1) th decoding processing until the decoding processing is performed for a plurality of times, and obtaining the rough semantic features of the voice information.
Specifically, in the process of multiple decoding processes, the output of the previous moment is used as the input of the next moment until the multiple decoding processes are performed, and the obtained final output is the rough semantic feature of the voice information.
203: and performing word segmentation processing on the voice information to obtain a keyword group.
In this embodiment, the voice information may be converted into text, and then the text is segmented to obtain at least one first keyword. Then, combining any two different first adjacent words and second adjacent words in the at least one first keyword to obtain at least one second keyword, wherein the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold value.
Specifically, the first adjacent word and the second adjacent word are any two different adjacent fields in the second candidate field, wherein the field interval is smaller than the first threshold value, and the field interval can be understood as the number of characters between corresponding positions of the first adjacent word and the second adjacent word in the corresponding text. For example, for the text "Shanghai city and the discone of 2016 open garden is located in Pudong New region", the first keyword can be obtained after word segmentation and screening: "Shanghai city", "2016", "Disney", "garden", "Pudong" and "New area". At this time, the number of characters between the corresponding positions of the first keywords "2016" and "Disney" in the text of the letter is 3, so the character distance between the first keywords "2016" and "Disney" is 3. And the number of characters between the corresponding positions of the first keywords 'Disney' and 'garden' in the text of the words is 0, so the character distance between the first keywords 'Disney' and 'garden' is 0.
In this embodiment, the first threshold may be set to 1, and thus, taking the text "Shanghai city and the Disneyland in 2016 open garden are located in Pudong New region" as an example, the first keyword satisfying the requirement: "Disney" and "paradise", and "Pudong" and "New zone". Thus, the third candidate fields "Disneyland" and "Pudong New zone" may be derived.
And then, matching each second keyword in the at least one second keyword with a preset entity library, and screening out the second keywords which are failed to match to obtain at least one third keyword. And deleting the first keywords forming each third keyword in the at least one first keyword to obtain at least one fourth keyword.
Specifically, the fourth keyword is the first keyword remaining after the first keyword constituting each of the at least one third keyword is removed. Illustratively, along with the example of the text "Shanghai city and the discone of 2016 being located in Pudong New region", assuming that the third keyword is "discone," the third keyword "discone" is composed of the first keywords "discone" and "parane," the first keywords "discone" and "parane" are derived from several first keywords: culling "Shanghai city", "2016", "Disney", "garden", "Pudong" and "New zone", the first keyword left: "Shanghai city", "2016", "Pudong" and "Xindistrict" are the fourth keywords.
And finally, combining at least one third keyword and at least one fourth keyword to obtain a keyword group.
Specifically, along with the example of the text "Shanghai city and the discone of 2016 was located in Pudong New zone" described above, the third keyword "discone" and the fourth keyword: the key word group can be obtained by combining Shanghai city, 2016 year, pudong and New district: "Shanghai city", "2016 year", "Disneyland", "Pudong" and "New district".
204: and carrying out multiple times of hidden characteristic extraction processing on the key word groups to obtain an initial hidden layer state characteristic vector.
In this embodiment, the keyword group may include at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, the present embodiment provides a method for extracting multiple hidden features from a keyword group to obtain an initial hidden layer feature vector, which specifically includes the following steps:
in the nth hidden feature extraction process, the first input hidden feature Hn Inputting GUR encoder to obtain first output hidden characteristic In Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the hidden feature H is inputn Is the 1 st keyword of the at least one keyword; concealing the first output from view feature In First input hidden feature H as n+1th hidden feature extraction processn+1 And carrying out n+1th hidden characteristic extraction treatment until the initial hidden layer state characteristic vector is obtained after carrying out multiple hidden characteristic extraction treatments.
205: and generating multiple reply words according to the rough semantic features and the initial hidden layer state feature vectors to obtain at least one reply word.
In the present embodiment, the input word vector K may be used in the p-th reply word generation processp Second input hidden feature Lp And inputting the rough semantic features into a gating circulating unit decoder to obtain a reply word Op And a second output hiding feature Rp Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, a word vector K is inputp The layer feature vector is initially hidden. Then, for the reply word Op Word embedding processing is carried out to obtain a reply word vector Sp . Finally, the word vector S is repliedp Input word vector K as p+1st reply word generation processingp+1 Concealing the second output from view the feature Rp Second input hidden feature L as p+1st reply word generation processingp+1 And carrying out the p+1st reply word generation processing until at least one reply word is obtained after the multiple reply word generation processing.
Specifically, as shown in FIG. 7, the generation process is to generate one reply word at a time and generate a reply word O at the p-th timep Then generating the reply word O at the p+1st timep+1 . But is provided withThe last (i.e. the p-th) generated reply word O is then used at the p+1st timep Also as one of the inputs for the p+1st time. While the other input is coarse semantic features, i.e. the reply word Op+1 Is composed of reply word Op Second output hidden feature R generated p-th time of word vector of (2)p And coarse semantic features.
206: and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.
In summary, in the rough semantic-based reply sentence determining method provided by the invention, by acquiring the previous round of voice information of the user at the current moment and then performing rough semantic extraction on the previous round of voice information, semantic features which can contain high-level abstract information in the previous round of voice information are obtained and are used as rough semantic features of the voice information of the user at the current moment, so that the synchronous extraction of key information and rough information in the previous round of voice information is realized. Then, word segmentation processing is carried out on the voice information of the current moment of the user, and multiple times of hidden feature extraction processing is carried out on the obtained multiple keywords, so that an initial hidden layer state feature vector of the voice information of the current moment of the user is obtained. And finally, generating multiple reply words according to the rough semantic features and the initial hidden layer feature vectors, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information. Based on the method, the rough semantic features containing the key information and the rough information in the previous dialog are used as one of the generation bases of the reply sentences in the current dialog, so that the reply sentence generation process contains more comprehensive information features of the previous dialog. Therefore, the generated reply sentence has higher precision, can be better matched with the main body of the dialogue, and improves the user experience.
Referring to fig. 8, fig. 8 is a functional block diagram of a reply sentence determining apparatus based on coarse semantics according to an embodiment of the present application. As shown in fig. 8, the rough semantic-based replysentence determination apparatus 800 includes:
an obtainingmodule 801, configured to obtain a previous round of voice information adjacent to the voice information according to an occurrence time of the voice information at a current time of the user, where the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and an absolute value of a difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;
theprocessing module 802 is configured to perform rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, perform word segmentation processing on the voice information to obtain a keyword group, and perform multiple times of hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector;
thegenerating module 803 is configured to perform multiple reply word generating processing according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word, and splice the at least one reply word according to a generating sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.
In an embodiment of the present invention, in terms of performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, theprocessing module 802 is specifically configured to:
detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label;
determining temporal information of the previous round of voice information according to at least one first word;
adding the temporal information into word labels of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word one by one;
inputting at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one;
and inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processing to obtain rough semantic features of the voice information.
In an embodiment of the present invention, theprocessing module 802 is specifically configured to, in determining temporal information of a previous round of speech information according to at least one first word:
inputting at least one first word into a gating cyclic unit encoder for encoding to obtain a second hidden layer state characteristic vector;
inputting the second hidden layer state characteristic vector into a multi-layer sensor to obtain a linear output result;
and inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of voice information.
In an embodiment of the present invention, in inputting at least one rough context information at least one first hidden layer state feature vector into a rough decoder to perform multiple decoding processes, aprocessing module 802 is specifically configured to:
in the ith decoding process, the input feature vector Ai Input to a coarse decoder to obtain an output feature vector Bi Wherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one rough context information, j is an integer greater than or equal to 1, and when i=1, a feature vector a isinputi 1 st coarse context information of the at least one coarse context information;
Calculate the output eigenvector Bi And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vectori Similarity D betweeni ;
For similarity Di Normalization processing is carried out to obtain an input feature vector Ai Weight E of (2)i ;
Weight Ei With the ith first hiddenTibetan state characteristic vector Ci Multiplying to obtain a weight feature vector Fi ;
Will weight the feature vector Fi And output feature vector Bi Adding to obtain a target output feature vector Gi ;
Outputting the target feature vector Gi Input feature vector A as the (i+1) -th decoding processi+1 And (3) performing the (i+1) th decoding processing until the decoding processing is performed for a plurality of times, and obtaining the rough semantic features of the voice information.
In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, in performing multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector, theprocessing module 802 is specifically configured to:
in the nth hidden feature extraction process, the first input hidden feature Hn Inputting the first output hidden characteristic I into a gating circulating unit encodern Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the hidden feature H is inputn Is the 1 st keyword of the at least one keyword;
concealing the first output from view feature In First input hidden feature H as n+1th hidden feature extraction processn+1 And carrying out n+1th hidden characteristic extraction treatment until the initial hidden layer state characteristic vector is obtained after carrying out multiple hidden characteristic extraction treatments.
In the embodiment of the present invention, in performing multiple reply word generation processing according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word, thegenerating module 803 is specifically configured to:
in the generation process of the p-th reply word, the input word vector Kp Second input hidden feature Lp And inputting the rough semantic features into a gating circulating unit decoder to obtain a reply word Op And a second output hiding feature Rp Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, a word vector K is inputp A layer feature vector is hidden for the initial layer;
For reply word Op Word embedding processing is carried out to obtain a reply word vector Sp ;
Will reply word vector Sp Input word vector K as p+1st reply word generation processingp+1 Concealing the second output from view the feature Rp Second input hidden feature L as p+1st reply word generation processingp+1 And carrying out the p+1st reply word generation processing until at least one reply word is obtained after the multiple reply word generation processing.
In the embodiment of the present invention, in terms of word segmentation processing on the voice information to obtain a keyword group, theprocessing module 802 is specifically configured to:
converting the voice information into text, and carrying out segmentation processing on the text to obtain at least one first keyword;
combining the first adjacent words and the second adjacent words to obtain at least one second keyword, wherein the first adjacent words and the second adjacent words are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold;
matching each second keyword in the at least one second keyword with a preset entity library, and screening out second keywords failing to match to obtain at least one third keyword;
Deleting the first keywords composing each third keyword in the at least one first keyword to obtain at least one fourth keyword;
and combining at least one third keyword and at least one fourth keyword to obtain a keyword group.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, theelectronic device 900 includes atransceiver 901, aprocessor 902, and amemory 903. Which are connected by abus 904. Thememory 903 is used to store computer programs and data, and the data stored in thememory 903 may be transferred to theprocessor 902.
Theprocessor 902 is configured to read a computer program in thememory 903 to perform the following operations:
acquiring a previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment, wherein the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference value between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;
performing rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information;
Word segmentation processing is carried out on the voice information to obtain a keyword group;
extracting hidden features of the key word groups for multiple times to obtain initial hidden layer feature vectors;
generating multiple reply words according to the rough semantic features and the initial hidden layer state feature vectors to obtain at least one reply word;
and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.
In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in terms of performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information:
detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label;
determining temporal information of the previous round of voice information according to at least one first word;
adding the temporal information into word labels of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word one by one;
Inputting at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one;
and inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processing to obtain rough semantic features of the voice information.
In an embodiment of the present invention, theprocessor 902 is specifically configured to, in determining temporal information of a previous round of speech information according to at least one first word:
inputting at least one first word into a gating cyclic unit encoder for encoding to obtain a second hidden layer state characteristic vector;
inputting the second hidden layer state characteristic vector into a multi-layer sensor to obtain a linear output result;
and inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of voice information.
In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in terms of inputting at least one rough context information at least one first hidden layer state feature vector into a rough decoder to perform a plurality of decoding processes to obtain rough semantic features of speech information:
In the ith decoding process, the input feature vector Ai Input to a coarse decoder to obtain an output feature vector Bi Wherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one rough context information, j is an integer greater than or equal to 1, and when i=1, a feature vector a isinputi 1 st coarse context information of the at least one coarse context information;
calculate the transfusionYielding a feature vector Bi And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vectori Similarity D betweeni ;
For similarity Di Normalization processing is carried out to obtain an input feature vector Ai Weight E of (2)i ;
Weight Ei And the ith first hidden layer state characteristic vector Ci Multiplying to obtain a weight feature vector Fi ;
Will weight the feature vector Fi And output feature vector Bi Adding to obtain a target output feature vector Gi ;
Outputting the target feature vector Gi Input feature vector A as the (i+1) -th decoding processi+1 And (3) performing the (i+1) th decoding processing until the decoding processing is performed for a plurality of times, and obtaining the rough semantic features of the voice information.
In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, theprocessor 902 is specifically configured to perform the following operations in performing multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector:
In the nth hidden feature extraction process, the first input hidden feature Hn Inputting the first output hidden characteristic I into a gating circulating unit encodern Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the hidden feature H is inputn Is the 1 st keyword of the at least one keyword;
concealing the first output from view feature In First input hidden feature H as n+1th hidden feature extraction processn+1 And carrying out n+1th hidden characteristic extraction treatment until the initial hidden layer state characteristic vector is obtained after carrying out multiple hidden characteristic extraction treatments.
In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in performing multiple reply word generating processes according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word:
in the generation process of the p-th reply word, the input word vector Kp Second input hidden feature Lp And inputting the rough semantic features into a gating circulating unit decoder to obtain a reply word Op And a second output hiding feature Rp Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, a word vector K is inputp A layer feature vector is hidden for the initial layer;
for reply word Op Word embedding processing is carried out to obtain a reply word vector Sp ;
Will reply word vector Sp Input word vector K as p+1st reply word generation processingp+1 Concealing the second output from view the feature Rp Second input hidden feature L as p+1st reply word generation processingp+1 And carrying out the p+1st reply word generation processing until at least one reply word is obtained after the multiple reply word generation processing.
In the embodiment of the present invention, in terms of word segmentation processing on the voice information to obtain a keyword group, theprocessor 902 is specifically configured to perform the following operations:
converting the voice information into text, and carrying out segmentation processing on the text to obtain at least one first keyword;
combining the first adjacent words and the second adjacent words to obtain at least one second keyword, wherein the first adjacent words and the second adjacent words are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold;
matching each second keyword in the at least one second keyword with a preset entity library, and screening out second keywords failing to match to obtain at least one third keyword;
Deleting the first keywords composing each third keyword in the at least one first keyword to obtain at least one fourth keyword;
and combining at least one third keyword and at least one fourth keyword to obtain a keyword group.
It should be understood that the reply sentence determining device based on rough semantics in the present application may include a smart Phone (such as an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a mobile internet device MID (Mobile Internet Devices, abbreviated as MID), a robot, a wearable device, etc. The above-described rough semantic based reply sentence determination device is merely an example and is not exhaustive, including but not limited to the above-described rough semantic based reply sentence determination device. In practical application, the reply sentence determining device based on coarse semantics may further include: intelligent vehicle terminals, computer devices, etc.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software in combination with a hardware platform. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the various embodiments or parts of the embodiments of the present invention.
Accordingly, the present application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the rough semantic based reply sentence determination methods described in the above method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, etc.
The present application also provides a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the rough semantic based reply sentence determination methods described in the method embodiments above.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional divisions when actually implemented, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated units described above may be implemented either in hardware or in software program modules.
The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, and the memory may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The foregoing has outlined rather broadly the more detailed description of the embodiments herein, and the detailed description of the principles and embodiments herein has been presented in terms of specific examples only to assist in the understanding of the methods and concepts of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.