CN114417891B

Movatterモバイル変換

Info

Publication number: CN114417891B
Application number: CN202210083351.8A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-01-22
Filing date: 2022-01-22
Publication date: 2023-05-09
Anticipated expiration: 2042-01-22
Also published as: WO2023137903A1; CN114417891A

Abstract

The application discloses a rough semantic-based reply statement determination method, a rough semantic-based reply statement determination device and electronic equipment, wherein the method comprises the following steps: acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user; performing rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information; word segmentation processing is carried out on the voice information to obtain a keyword group; extracting hidden features of the key word groups for multiple times to obtain initial hidden layer feature vectors; generating multiple reply words according to the rough semantic features and the initial hidden layer state feature vectors to obtain at least one reply word; and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

Description

Reply statement determination method and device based on rough semantics and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for determining reply sentences based on rough semantics and electronic equipment.

Background

Currently, in a conventional dialogue model, a previous round of dialogue text is usually encoded, a hidden layer feature of the obtained encoded information is used as one of inputs of a decoder, and then a reply of the dialogue is automatically generated by the decoder according to a time sequence. In the method, the hidden layer state characteristics coded by the text of the previous dialog are used as one of the generation basis of the reply sentences in the current dialog, so that the information characteristics of the previous dialog are contained in the reply sentence generation process.

However, in the conventional scheme, in order to enable the model to construct a reply sentence for key information in a dialogue, the features tend to be focused on the key information in the previous dialogue, then in the actual extraction process, the key information is extracted as features, and some rough information in the dialogue tends to be discarded. Therefore, some rough information in some texts is ignored, so that the real attention point of the dialogue can be reflected, and the reply sentence is not accurate enough.

Disclosure of Invention

In order to solve the above problems in the prior art, the embodiment of the application provides a method, a device and an electronic device for determining reply sentences based on rough semantics, which can extract key information and rough information in a previous dialog at the same time, so that the generated reply sentences are more accurate.

In a first aspect, embodiments of the present application provide a method for determining a reply sentence based on coarse semantics, including:

acquiring a previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment, wherein the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference value between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

performing rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information;

word segmentation processing is carried out on the voice information to obtain a keyword group;

extracting hidden features of the key word groups for multiple times to obtain initial hidden layer feature vectors;

generating multiple reply words according to the rough semantic features and the initial hidden layer state feature vectors to obtain at least one reply word;

and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In a second aspect, embodiments of the present application provide a reply sentence determining apparatus based on coarse semantics, including:

The acquisition module is used for acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment, wherein the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference value between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

the processing module is used for carrying out rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information, carrying out word segmentation processing on the voice information to obtain a key phrase, and carrying out repeated hidden feature extraction processing on the key phrase to obtain an initial hidden layer feature vector;

the generating module is used for generating and processing multiple reply words according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In a third aspect, embodiments of the present application provide an electronic device, including: and a processor coupled to the memory, the memory for storing a computer program, the processor for executing the computer program stored in the memory to cause the electronic device to perform the method as in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method as in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer being operable to cause a computer to perform a method as in the first aspect.

The implementation of the embodiment of the application has the following beneficial effects:

in the embodiment of the application, the previous round of voice information of the current moment of the user is obtained, and then rough semantic extraction is carried out on the previous round of voice information, so that semantic features which can contain high-level abstract information in the previous round of voice information are obtained and serve as rough semantic features of the voice information of the current moment of the user, and therefore synchronous extraction of key information and rough information in the previous round of voice information is achieved. Then, word segmentation processing is carried out on the voice information of the current moment of the user, and multiple times of hidden feature extraction processing is carried out on the obtained multiple keywords, so that an initial hidden layer state feature vector of the voice information of the current moment of the user is obtained. And finally, generating multiple reply words according to the rough semantic features and the initial hidden layer feature vectors, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information. Based on the method, the rough semantic features containing the key information and the rough information in the previous dialog are used as one of the generation bases of the reply sentences in the current dialog, so that the reply sentence generation process contains more comprehensive information features of the previous dialog. Therefore, the generated reply sentence has higher precision, can be better matched with the main body of the dialogue, and improves the user experience.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic hardware structure of a reply sentence determining device based on coarse semantics according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a rough semantic-based reply sentence determination method according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for extracting rough semantic features of voice information according to previous voice information to obtain rough semantic features corresponding to voice information according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a gating cycle unit encoder according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a multi-layer sensor according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processes to obtain rough semantic features of voice information according to an embodiment of the present application;

FIG. 7 is a block flow diagram of a reply word generation process according to an embodiment of the present application;

FIG. 8 is a functional block diagram of a rough semantic-based reply sentence determination device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without undue burden are within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly understand that the embodiments described herein may be combined with other embodiments.

First, referring to fig. 1, fig. 1 is a schematic hardware structure of a reply sentence determining apparatus based on coarse semantics according to an embodiment of the present application. The rough semantic based reply sentence determining means 100 comprises at least oneprocessor 101, acommunication line 102, amemory 103 and at least onecommunication interface 104.

In this embodiment, theprocessor 101 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in the present application.

Communication line 102 may include a pathway to transfer information between the above-described components.

Thecommunication interface 104, which may be any transceiver-like device (e.g., antenna, etc.), is used to communicate with other devices or communication networks, such as ethernet, RAN, wireless local area network (wireless local area networks, WLAN), etc.

Thememory 103 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In this embodiment, thememory 103 may be independently provided and connected to theprocessor 101 via thecommunication line 102.Memory 103 may also be integrated withprocessor 101. Thememory 103 provided by embodiments of the present application may generally have non-volatility. Thememory 103 is used for storing computer-executable instructions for executing the embodiments of the present application, and is controlled by theprocessor 101 to execute the instructions. Theprocessor 101 is configured to execute computer-executable instructions stored in thememory 103, thereby implementing the methods provided in the embodiments of the present application described below.

In alternative embodiments, computer-executable instructions may also be referred to as application code, which is not specifically limited in this application.

In alternative embodiments,processor 101 may include one or more CPUs, such as CPU0 and CPU1 in fig. 1.

In an alternative embodiment, the rough semantic based replysentence determination device 100 may include multiple processors, such as theprocessor 101 and theprocessor 107 in fig. 1. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In an alternative embodiment, if the replysentence determining apparatus 100 based on coarse semantics is a server, for example, it may be a stand-alone server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery network (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platform. The rough semantic based replysentence determining apparatus 100 may further include anoutput device 105 and aninput device 106. Theoutput device 105 communicates with theprocessor 101 and may display information in a variety of ways. For example, theoutput device 105 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. Theinput device 106 is in communication with theprocessor 101 and may receive user input in a variety of ways. For example, theinput device 106 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.

The above-described replysentence determining apparatus 100 based on rough semantics may be a general-purpose device or a special-purpose device. The present embodiment does not limit the type of the replysentence determination apparatus 100 based on the rough semantics.

Second, it should be noted that the embodiments disclosed herein may acquire and process related data based on artificial intelligence techniques. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Finally, the reply sentence determining method based on the rough semantics can be applied to scenes such as telephone consultation, e-commerce sales, off-line entity sales, service popularization, agent telephone outbound, social platform popularization and the like. In the application, the method for determining the reply sentence based on the rough semantics is mainly described by taking a telephone consultation scene as an example, and the method for determining the reply sentence based on the rough semantics in other scenes is similar to the implementation mode in the telephone consultation scene and is not described herein.

The rough semantic-based reply sentence determination method disclosed in the present application will be described below:

referring to fig. 2, fig. 2 is a flow chart of a method for determining a reply sentence based on coarse semantics according to an embodiment of the present application. The method for determining the reply sentence based on the rough semantics comprises the following steps:

201: and acquiring the previous round of voice information adjacent to the voice information according to the occurrence time of the voice information at the current moment of the user.

In this embodiment, the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is the smallest. In short, the previous round of voice information is the last sentence the user speaks before speaking the voice information at the current time.

For example, the previous round of voice information may be determined by querying historical dialogue data that records dialogue data generated by a dialogue event to which the voice information of the current time of the user belongs before the current time, by the occurrence time of the voice information of the current time of the user. Specifically, two related sentence queues may be stored in the historical dialogue data, where one queue is used to store user sentences issued by the user, and the other pair is used to store reply sentences made by the AI to the user's sentences. Meanwhile, each user sentence in the user sentence queue and each reply sentence in the reply sentence queue contain a dialogue identifier and dialogue occurrence time, and the user sentences and the reply sentences with the same identifier are combined into a question-answer pair through the dialogue identifier, namely, the reply sentences with the same dialogue identifier are answers to the user sentences. Therefore, the logicality of questions and answers in the historical dialogue data can be guaranteed, and meanwhile, sentences of the user and AI are stored separately, so that the search is facilitated.

Therefore, in the present embodiment, by querying the user sentence queue, it is possible to determine, as the previous round of voice information, the voice information whose dialogue occurrence time is earlier than the voice information at the current time of the user and whose absolute value of the difference between the occurrence time and the occurrence time of the first sentence is the smallest.

202: and carrying out rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information.

In this embodiment, coarse semantic features may be understood as semantic features that contain high-level abstract information in the previous round of speech information. Illustratively, multiple high order parallel sequences may be obtained by actively constructing one high order coarse sequence representation (high level coarse sequence representation), which is then analyzed. And then, a low-order rough sequence is generated through a layered structure, so that information in a plurality of high-order parallel sequences flows to the low-order rough sequence, the synchronous extraction of key information and rough information in voice information is realized, and a plurality of layers of information can be synchronously embodied. Meanwhile, after the method is converted into the low-order rough sequence, the model for generating the reply sentence can also better memorize and understand the long-term content, so that meaningful replies closely related to the theme are generated, and the user experience is improved.

Exemplary, this embodiment provides a method for extracting coarse semantics of voice information according to a previous round of voice information to obtain coarse semantic features corresponding to the voice information, as shown in fig. 3, where the method includes:

301: and detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information.

In this embodiment, the detection process may be to perform word segmentation after performing text conversion on the previous round of voice information, and then obtain all the words that can be obtained through word segmentation processing as the at least one first word. Meanwhile, each of the at least one first word may include a word tag, which may be part-of-speech information of the corresponding first word, for example: nouns, verbs, named entities, etc.

Thus, in this embodiment, named entity information in text of text converted by conditional random field model (Conditional Random Fields, CRF) may be extracted, and the type of named entity, such as name of person or organization, may be marked by CRF. And then, word segmentation and Part-Of-Speech (POS) marking are carried out on the text by using a Part-Of-Speech marking tool, and nouns and verbs in the text are extracted. The combined result of the CRF and POS is needed in this process because POS is identified only in terms, and the CRF may be a complete phrase, such as: i work at Shanghai's complex denier university, the CRF can completely identify the entity of the organization name "Shanghai's complex denier university", while POS can only identify nouns: "Shanghai", "double denier" and "university". So in terms of processing entity words, if the result of the POS is contained by the CRF, the result of the CRF will be used preferentially, and in terms of verbs, only the result of the POS will be used. Thus, the first word containing the part-of-speech information label can be obtained.

In an alternative embodiment, if the language used by the user is english, a set of verbs and named entities in a corresponding field may be pre-constructed, then verbs and named entities in the original sentence are extracted in a matching manner, and extracting the nouns in english may also use POS to perform noun recognition and extraction, so as to obtain the first word containing the part-of-speech information label.

302: the temporal information of the previous round of voice information is determined according to at least one first word.

In this embodiment, at least one first word obtained by word segmentation may be input to a gate-controlled loop unit (Gate Recurrent Unit, GRU) encoder for encoding, so as to obtain a second hidden layer feature vector. The second hidden layer feature vector is then input to a multi-layer perceptron (MultiLayer Perceptron, MLP) to yield a linear output result. And finally, inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of voice information.

Specifically, the GRU is structured as shown in FIG. 4, which includes a reset gate r_t Update door z_t Candidate memory cell

And a current time memory unit h_t 。

Specifically, the gate r is reset_t Can be represented by the formula (1):

r_t ＝σ(W_r X_t +U_r h_t-1 +b_r ).........①

Wherein σ is the activation function, W_r And U_r Is reset gate r_t Corresponding parameter matrix, initialized value is random, and can be obtained through training of modelNew values. b_r Is reset gate r_t The corresponding bias is also trainable.

Further, update door z_t Can be represented by the formula (2):

z_t ＝σ(W_z X_t +U_z h_t-1 +b_z ).........②

wherein W is_z And U_z Is to update the door z_t The corresponding parameter matrix, the initialized value is random, and a new value can be obtained through training the model. b_z Is to update the door z_t The corresponding bias is also trainable.

Further, candidate memory cells

Can be represented by the formula (3):

wherein tanh is an activation function, and W and U are candidate memory cells

The corresponding parameter matrix, the initialized value is random, and a new value can be obtained through training the model. b is a candidate memory cell->

The corresponding bias is also trainable.

Further, the current time memory cell h_t Can be represented by the formula (4):

wherein z is_t Is weight, is trainable.

In this embodiment, the structure of the MLP is shown in fig. 5, and the MLP is composed of two Linear layers Linear and a ReLu activation function, after the Linear output result is output through the last Linear layer, the Linear output result is input into the softmax function again for multi-label classification, and finally, the temporal classifier judges the temporal state of the current sentence. Therefore, misidentification and missing identification caused by simply using independent words such as 'over', 'on' and the like in the traditional tense identification are avoided. Such as: the voice message "i am running" is now in progress, but is not recognized in the conventional recognition method because it does not contain independent words such as "pass", "go", "have been" and the like.

303: and adding the temporal information into the word label of each first word to obtain at least one second word corresponding to at least one first word one by one.

In brief, in this embodiment, the second word is the first word with the corresponding temporal information of the voice information added to the word label. Therefore, the second word also carries corresponding part-of-speech information and temporal information of the voice on the basis of carrying the information of the voice, so that the reply sentence generated later is more accurate.

304: and inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information corresponding to the at least one second word one by one and at least one first hidden layer state characteristic vector corresponding to the at least one second word one by one.

In this embodiment, the coarse encoder may be a GRU encoder. Specifically, at the time of encoding, at least one second word is sequentially input into the GRU encoder, and the encoder outputs the corresponding rough context information and the first hidden layer feature vector. In the encoding process, besides the second word of the current encoding, the first hidden layer feature vector output in the last encoding process can also be used as the input of the current encoding. That is, when encoding the xth second word, and the xth-1 first hidden layer state feature vector may be input to the GRU encoder to obtain the xth rough context information and the xth first hidden layer state feature vector. And when x=1, since the 0 th second word does not exist, only the 1 st second word is input into the GRU encoder for encoding.

305: and inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processing to obtain rough semantic features of the voice information.

In this embodiment, when extracting the rough semantic features, the importance degree of each second word obtained by splitting is different for the speech information. Thus, before the rough context information is input to the rough decoder, the information may be subjected to attention processing to acquire the importance degree of the respective rough context information.

Illustratively, each second word corresponds to one hidden layer feature vector, i.e. the first hidden layer feature vector, in the encoder, and in brief, how many pieces of rough context information there are how many first hidden layer feature vectors. Thus, the rough context information can be input into the decoder, which calculates the similarity between the feature vector of the current decoding process (the output of the current decoding process of the decoder) and the hidden layer state feature obtained by decoding the input rough context information at the time of decoding. Thus, a similarity value is calculated for each piece of rough context information, and then the similarity values are normalized to obtain a weight corresponding to each piece of rough context information. And multiplying the weight corresponding to each piece of rough context information with the hidden layer state characteristic vector obtained by the rough context information input encoder to obtain attention characteristic, and adding the attention characteristic with the output characteristic vector obtained when the rough context information is input into the decoder to obtain the final characteristic obtained when the rough context information is input into the decoder.

Based on this, the present embodiment provides a method for inputting at least one first hidden layer state feature vector of at least one piece of rough context information into a rough decoder to perform multiple decoding processes, so as to obtain rough semantic features of voice information, as shown in fig. 6, the method includes:

601: in the ith decoding process, the input feature vector A_i Input to a coarse decoder to obtain an output feature vector B_i 。

In the present embodiment, i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one piece of rough context information, j is an integer greater than or equal to 1, and when i=1, the feature vector a is input_i Is the 1 st rough context information of the at least one rough context information.

602: calculate the output eigenvector B_i And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vector_i Similarity D between_i 。

In the present embodiment, the output feature vector B may be calculated_i And the i first hidden layer state feature vector C_i Cosine similarity between them to obtain similarity D_i 。

603: for similarity D_i Normalization processing is carried out to obtain an input feature vector A_i Weight E of (2)_i 。

In the present embodiment, the similarity degree D may be_i Normalizing the input softmax function to obtain an input feature vector A_i Weight E of (2)_i 。

604: weight E_i And the ith first hidden layer state characteristic vector C_i Multiplying to obtain a weight feature vector F_i 。

605: will weight the feature vector F_i And output feature vector B_i Adding to obtain a target output feature vector G_i 。

606: outputting the target feature vector G_i Input feature vector A as the (i+1) -th decoding process_i+1 And (3) performing the (i+1) th decoding processing until the decoding processing is performed for a plurality of times, and obtaining the rough semantic features of the voice information.

Specifically, in the process of multiple decoding processes, the output of the previous moment is used as the input of the next moment until the multiple decoding processes are performed, and the obtained final output is the rough semantic feature of the voice information.

203: and performing word segmentation processing on the voice information to obtain a keyword group.

In this embodiment, the voice information may be converted into text, and then the text is segmented to obtain at least one first keyword. Then, combining any two different first adjacent words and second adjacent words in the at least one first keyword to obtain at least one second keyword, wherein the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold value.

Specifically, the first adjacent word and the second adjacent word are any two different adjacent fields in the second candidate field, wherein the field interval is smaller than the first threshold value, and the field interval can be understood as the number of characters between corresponding positions of the first adjacent word and the second adjacent word in the corresponding text. For example, for the text "Shanghai city and the discone of 2016 open garden is located in Pudong New region", the first keyword can be obtained after word segmentation and screening: "Shanghai city", "2016", "Disney", "garden", "Pudong" and "New area". At this time, the number of characters between the corresponding positions of the first keywords "2016" and "Disney" in the text of the letter is 3, so the character distance between the first keywords "2016" and "Disney" is 3. And the number of characters between the corresponding positions of the first keywords 'Disney' and 'garden' in the text of the words is 0, so the character distance between the first keywords 'Disney' and 'garden' is 0.

In this embodiment, the first threshold may be set to 1, and thus, taking the text "Shanghai city and the Disneyland in 2016 open garden are located in Pudong New region" as an example, the first keyword satisfying the requirement: "Disney" and "paradise", and "Pudong" and "New zone". Thus, the third candidate fields "Disneyland" and "Pudong New zone" may be derived.

And then, matching each second keyword in the at least one second keyword with a preset entity library, and screening out the second keywords which are failed to match to obtain at least one third keyword. And deleting the first keywords forming each third keyword in the at least one first keyword to obtain at least one fourth keyword.

Specifically, the fourth keyword is the first keyword remaining after the first keyword constituting each of the at least one third keyword is removed. Illustratively, along with the example of the text "Shanghai city and the discone of 2016 being located in Pudong New region", assuming that the third keyword is "discone," the third keyword "discone" is composed of the first keywords "discone" and "parane," the first keywords "discone" and "parane" are derived from several first keywords: culling "Shanghai city", "2016", "Disney", "garden", "Pudong" and "New zone", the first keyword left: "Shanghai city", "2016", "Pudong" and "Xindistrict" are the fourth keywords.

And finally, combining at least one third keyword and at least one fourth keyword to obtain a keyword group.

Specifically, along with the example of the text "Shanghai city and the discone of 2016 was located in Pudong New zone" described above, the third keyword "discone" and the fourth keyword: the key word group can be obtained by combining Shanghai city, 2016 year, pudong and New district: "Shanghai city", "2016 year", "Disneyland", "Pudong" and "New district".

204: and carrying out multiple times of hidden characteristic extraction processing on the key word groups to obtain an initial hidden layer state characteristic vector.

In this embodiment, the keyword group may include at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, the present embodiment provides a method for extracting multiple hidden features from a keyword group to obtain an initial hidden layer feature vector, which specifically includes the following steps:

in the nth hidden feature extraction process, the first input hidden feature H_n Inputting GUR encoder to obtain first output hidden characteristic I_n Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the hidden feature H is input_n Is the 1 st keyword of the at least one keyword; concealing the first output from view feature I_n First input hidden feature H as n+1th hidden feature extraction process_n+1 And carrying out n+1th hidden characteristic extraction treatment until the initial hidden layer state characteristic vector is obtained after carrying out multiple hidden characteristic extraction treatments.

205: and generating multiple reply words according to the rough semantic features and the initial hidden layer state feature vectors to obtain at least one reply word.

In the present embodiment, the input word vector K may be used in the p-th reply word generation process_p Second input hidden feature L_p And inputting the rough semantic features into a gating circulating unit decoder to obtain a reply word O_p And a second output hiding feature R_p Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, a word vector K is input_p The layer feature vector is initially hidden. Then, for the reply word O_p Word embedding processing is carried out to obtain a reply word vector S_p . Finally, the word vector S is replied_p Input word vector K as p+1st reply word generation processing_p+1 Concealing the second output from view the feature R_p Second input hidden feature L as p+1st reply word generation processing_p+1 And carrying out the p+1st reply word generation processing until at least one reply word is obtained after the multiple reply word generation processing.

Specifically, as shown in FIG. 7, the generation process is to generate one reply word at a time and generate a reply word O at the p-th time_p Then generating the reply word O at the p+1st time_p+1 . But is provided withThe last (i.e. the p-th) generated reply word O is then used at the p+1st time_p Also as one of the inputs for the p+1st time. While the other input is coarse semantic features, i.e. the reply word O_p+1 Is composed of reply word O_p Second output hidden feature R generated p-th time of word vector of (2)_p And coarse semantic features.

206: and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In summary, in the rough semantic-based reply sentence determining method provided by the invention, by acquiring the previous round of voice information of the user at the current moment and then performing rough semantic extraction on the previous round of voice information, semantic features which can contain high-level abstract information in the previous round of voice information are obtained and are used as rough semantic features of the voice information of the user at the current moment, so that the synchronous extraction of key information and rough information in the previous round of voice information is realized. Then, word segmentation processing is carried out on the voice information of the current moment of the user, and multiple times of hidden feature extraction processing is carried out on the obtained multiple keywords, so that an initial hidden layer state feature vector of the voice information of the current moment of the user is obtained. And finally, generating multiple reply words according to the rough semantic features and the initial hidden layer feature vectors, and splicing the obtained at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information. Based on the method, the rough semantic features containing the key information and the rough information in the previous dialog are used as one of the generation bases of the reply sentences in the current dialog, so that the reply sentence generation process contains more comprehensive information features of the previous dialog. Therefore, the generated reply sentence has higher precision, can be better matched with the main body of the dialogue, and improves the user experience.

Referring to fig. 8, fig. 8 is a functional block diagram of a reply sentence determining apparatus based on coarse semantics according to an embodiment of the present application. As shown in fig. 8, the rough semantic-based replysentence determination apparatus 800 includes:

an obtainingmodule 801, configured to obtain a previous round of voice information adjacent to the voice information according to an occurrence time of the voice information at a current time of the user, where the occurrence time of the previous round of voice information is less than the occurrence time of the voice information, and an absolute value of a difference between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

theprocessing module 802 is configured to perform rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, perform word segmentation processing on the voice information to obtain a keyword group, and perform multiple times of hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector;

thegenerating module 803 is configured to perform multiple reply word generating processing according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word, and splice the at least one reply word according to a generating sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In an embodiment of the present invention, in terms of performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information, theprocessing module 802 is specifically configured to:

detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label;

determining temporal information of the previous round of voice information according to at least one first word;

adding the temporal information into word labels of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word one by one;

inputting at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one;

and inputting at least one first hidden layer state feature vector of at least one rough context information into a rough decoder to perform multiple decoding processing to obtain rough semantic features of the voice information.

In an embodiment of the present invention, theprocessing module 802 is specifically configured to, in determining temporal information of a previous round of speech information according to at least one first word:

inputting at least one first word into a gating cyclic unit encoder for encoding to obtain a second hidden layer state characteristic vector;

inputting the second hidden layer state characteristic vector into a multi-layer sensor to obtain a linear output result;

and inputting the linear output result into a temporal classifier to obtain the temporal information of the previous round of voice information.

In an embodiment of the present invention, in inputting at least one rough context information at least one first hidden layer state feature vector into a rough decoder to perform multiple decoding processes, aprocessing module 802 is specifically configured to:

in the ith decoding process, the input feature vector A_i Input to a coarse decoder to obtain an output feature vector B_i Wherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of at least one rough context information, j is an integer greater than or equal to 1, and when i=1, a feature vector a isinput_i 1 st coarse context information of the at least one coarse context information;

Calculate the output eigenvector B_i And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vector_i Similarity D between_i ；

For similarity D_i Normalization processing is carried out to obtain an input feature vector A_i Weight E of (2)_i ；

Weight E_i With the ith first hiddenTibetan state characteristic vector C_i Multiplying to obtain a weight feature vector F_i ；

Will weight the feature vector F_i And output feature vector B_i Adding to obtain a target output feature vector G_i ；

Outputting the target feature vector G_i Input feature vector A as the (i+1) -th decoding process_i+1 And (3) performing the (i+1) th decoding processing until the decoding processing is performed for a plurality of times, and obtaining the rough semantic features of the voice information.

In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, in performing multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector, theprocessing module 802 is specifically configured to:

in the nth hidden feature extraction process, the first input hidden feature H_n Inputting the first output hidden characteristic I into a gating circulating unit encoder_n Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of at least one keyword, m is an integer greater than or equal to 1, and when n=1, the hidden feature H is input_n Is the 1 st keyword of the at least one keyword;

concealing the first output from view feature I_n First input hidden feature H as n+1th hidden feature extraction process_n+1 And carrying out n+1th hidden characteristic extraction treatment until the initial hidden layer state characteristic vector is obtained after carrying out multiple hidden characteristic extraction treatments.

In the embodiment of the present invention, in performing multiple reply word generation processing according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word, thegenerating module 803 is specifically configured to:

in the generation process of the p-th reply word, the input word vector K_p Second input hidden feature L_p And inputting the rough semantic features into a gating circulating unit decoder to obtain a reply word O_p And a second output hiding feature R_p Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, a word vector K is input_p A layer feature vector is hidden for the initial layer;

For reply word O_p Word embedding processing is carried out to obtain a reply word vector S_p ；

Will reply word vector S_p Input word vector K as p+1st reply word generation processing_p+1 Concealing the second output from view the feature R_p Second input hidden feature L as p+1st reply word generation processing_p+1 And carrying out the p+1st reply word generation processing until at least one reply word is obtained after the multiple reply word generation processing.

In the embodiment of the present invention, in terms of word segmentation processing on the voice information to obtain a keyword group, theprocessing module 802 is specifically configured to:

converting the voice information into text, and carrying out segmentation processing on the text to obtain at least one first keyword;

combining the first adjacent words and the second adjacent words to obtain at least one second keyword, wherein the first adjacent words and the second adjacent words are any two different first keywords in the at least one first keyword, and the field interval between the first adjacent words and the second adjacent words is smaller than a first threshold;

matching each second keyword in the at least one second keyword with a preset entity library, and screening out second keywords failing to match to obtain at least one third keyword;

Deleting the first keywords composing each third keyword in the at least one first keyword to obtain at least one fourth keyword;

and combining at least one third keyword and at least one fourth keyword to obtain a keyword group.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, theelectronic device 900 includes atransceiver 901, aprocessor 902, and amemory 903. Which are connected by abus 904. Thememory 903 is used to store computer programs and data, and the data stored in thememory 903 may be transferred to theprocessor 902.

Theprocessor 902 is configured to read a computer program in thememory 903 to perform the following operations:

and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in terms of performing rough semantic extraction on the voice information according to the previous round of voice information to obtain rough semantic features corresponding to the voice information:

In an embodiment of the present invention, theprocessor 902 is specifically configured to, in determining temporal information of a previous round of speech information according to at least one first word:

In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in terms of inputting at least one rough context information at least one first hidden layer state feature vector into a rough decoder to perform a plurality of decoding processes to obtain rough semantic features of speech information:

calculate the transfusionYielding a feature vector B_i And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vector_i Similarity D between_i ；

Weight E_i And the ith first hidden layer state characteristic vector C_i Multiplying to obtain a weight feature vector F_i ；

In an embodiment of the present invention, the keyword group includes at least one keyword, and the at least one keyword is arranged according to a sequence of a position of each keyword in the at least one keyword in the voice information. Based on this, theprocessor 902 is specifically configured to perform the following operations in performing multiple hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector:

In an embodiment of the present invention, theprocessor 902 is specifically configured to perform the following operations in performing multiple reply word generating processes according to the rough semantic feature and the initial hidden layer feature vector to obtain at least one reply word:

In the embodiment of the present invention, in terms of word segmentation processing on the voice information to obtain a keyword group, theprocessor 902 is specifically configured to perform the following operations:

It should be understood that the reply sentence determining device based on rough semantics in the present application may include a smart Phone (such as an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a mobile internet device MID (Mobile Internet Devices, abbreviated as MID), a robot, a wearable device, etc. The above-described rough semantic based reply sentence determination device is merely an example and is not exhaustive, including but not limited to the above-described rough semantic based reply sentence determination device. In practical application, the reply sentence determining device based on coarse semantics may further include: intelligent vehicle terminals, computer devices, etc.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software in combination with a hardware platform. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the various embodiments or parts of the embodiments of the present invention.

Accordingly, the present application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the rough semantic based reply sentence determination methods described in the above method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, etc.

The present application also provides a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the rough semantic based reply sentence determination methods described in the method embodiments above.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional divisions when actually implemented, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, and the memory may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of the embodiments herein, and the detailed description of the principles and embodiments herein has been presented in terms of specific examples only to assist in the understanding of the methods and concepts of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method for determining reply sentences based on coarse semantics, the method comprising:

acquiring the adjacent previous round of voice information of the voice information according to the occurrence time of the voice information of the current moment of the user, wherein the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference value between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

Performing rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information, wherein the rough semantic features specifically comprise: detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label; determining temporal information of the previous round of voice information according to the at least one first word; adding the temporal information into word labels of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word one by one; inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one; inputting the at least one first hidden layer state feature vector of the at least one rough context information into a rough decoder for multiple decoding processing to obtain rough semantic features of the voice information;

Word segmentation is carried out on the voice information to obtain a keyword group, wherein the keyword group comprises at least one keyword, and the at least one keyword is arranged according to the sequence of the position of each keyword in the at least one keyword in the voice information;

extracting the hidden characteristics of the key word group for a plurality of times to obtain an initial hidden layer characteristic vector, which specifically comprises the following steps: in the nth hidden feature extraction process, the first input hidden feature H_n Inputting the first output hidden characteristic I into a gating circulating unit encoder_n Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H_n Is the 1 st keyword of the at least one keyword; concealing the first output from view feature I_n First input hidden feature H as n+1th hidden feature extraction process_n+1 Performing the n+1th hidden feature extraction process until the initial hidden layer feature vector is obtained after the multiple hidden feature extraction processes are performed;

performing repeated reply word generation processing according to the rough semantic features and the initial hidden layer state feature vector to obtain at least one reply word;

2. The method of claim 1, wherein said determining temporal information of the previous round of speech information from the at least one first word comprises:

inputting the at least one first word into a gating cyclic unit encoder for encoding to obtain a second hidden layer state characteristic vector;

3. The method of claim 1, wherein said inputting said at least one first hidden layer state feature vector of said at least one coarse context information into a coarse decoder for a plurality of decoding processes, resulting in coarse semantic features of said speech information, comprises:

in the ith decoding process, the input feature vector A_i Inputting the rough decoder to obtain an output feature vector B_i Wherein i is an integer greater than or equal to 1 and less than or equal to j, j is the number of the at least one rough context information, j is an integer greater than or equal to 1, when i=1, the input feature vector a_i 1 st rough context information among the at least one rough context information;

calculating the output feature vector B_i And an ith first hidden layer state feature vector C of the at least one first hidden layer state feature vector_i Similarity D between_i ；

For the similarity D_i Normalization processing is carried out to obtain the input feature vector A_i Weight E of (2)_i ；

The weight E is given_i And the ith first hidden layer state characteristic vector C_i Multiplying to obtain a weight feature vector F_i ；

The weight feature vector F_i And the output characteristic vector B_i Adding to obtain a target output feature vector G_i ；

Outputting the target output characteristic vector G_i Input feature vector A as the (i+1) -th decoding process_i+1 And (3) performing the (i+1) th decoding processing until the multiple decoding processing is performed, and obtaining the rough semantic features of the voice information.

4. The method according to claim 1, wherein the performing multiple reply word generation processing according to the rough semantic feature and the initial hidden layer state feature vector to obtain at least one reply word includes:

in the generation process of the p-th reply word, the input word vector is inputK_p Second input hidden feature L_p And the rough semantic features are input into a gating circulating unit decoder to obtain a reply word O_p And a second output hiding feature R_p Wherein p is an integer greater than or equal to 1 and less than or equal to q, q being determined by the speech information as an integer greater than or equal to 1, when p=1, the input word vector K_p A layer feature vector is presented for the initial hidden layer;

for the reply word O_p Word embedding processing is carried out to obtain a reply word vector S_p ；

The reply word vector S_p Input word vector K as p+1st reply word generation processing_p+1 Concealing the second output from view the feature R_p Second input hidden feature L as the p+1th reply word generation process_p+1 And carrying out the p+1st reply word generation processing until the multiple reply word generation processing is carried out, and obtaining the at least one reply word.

5. The method of claim 1, wherein the word segmentation of the voice information to obtain a keyword group comprises:

combining a first adjacent word and a second adjacent word to obtain at least one second keyword, wherein the first adjacent word and the second adjacent word are any two different first keywords in the at least one first keyword, and a field interval between the first adjacent word and the second adjacent word is smaller than a first threshold;

and combining the at least one third keyword and the at least one fourth keyword to obtain the keyword group.

6. A rough semantic based reply sentence determination apparatus, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a previous round of voice information adjacent to the voice information according to the occurrence time of the voice information of the user at the current moment, wherein the occurrence time of the previous round of voice information is smaller than the occurrence time of the voice information, and the absolute value of the difference value between the occurrence time of the previous round of voice information and the occurrence time of the voice information is minimum;

the processing module is used for carrying out rough semantic extraction on the voice information according to the voice information of the previous round to obtain rough semantic features corresponding to the voice information, and is specifically used for: detecting the previous round of voice information to obtain at least one first word contained in the previous round of voice information, wherein each first word in the at least one first word comprises a word label; determining temporal information of the previous round of voice information according to the at least one first word; adding the temporal information into word labels of each first word to obtain at least one second word, wherein the at least one second word corresponds to the at least one first word one by one; inputting the at least one second word into a rough encoder for encoding to obtain at least one rough context information and at least one first hidden layer state feature vector, wherein the at least one rough context information corresponds to the at least one second word one by one, and the at least one first hidden layer state feature vector corresponds to the at least one second word one by one; inputting the at least one first hidden layer state feature vector of the at least one rough context information into a rough decoder for multiple decoding processing to obtain rough semantic features of the voice information;

The word segmentation processing is carried out on the voice information,obtaining a keyword group, carrying out multiple times of hidden feature extraction processing on the keyword group to obtain an initial hidden layer feature vector, wherein the keyword group comprises at least one keyword, and the at least one keyword is arranged according to the sequence of the positions of each keyword in the at least one keyword in the voice information, and is specifically used for: in the nth hidden feature extraction process, the first input hidden feature H_n Inputting the first output hidden characteristic I into a gating circulating unit encoder_n Wherein n is an integer greater than or equal to 1 and less than or equal to m, m is the number of the at least one keyword, m is an integer greater than or equal to 1, and when n=1, the input hidden feature H_n Is the 1 st keyword of the at least one keyword; concealing the first output from view feature I_n First input hidden feature H as n+1th hidden feature extraction process_n+1 Performing the n+1th hidden feature extraction process until the initial hidden layer feature vector is obtained after the multiple hidden feature extraction processes are performed;

and the generation module is used for generating and processing multiple reply words according to the rough semantic features and the initial hidden layer feature vector to obtain at least one reply word, and splicing the at least one reply word according to the generation sequence of each reply word in the at least one reply word to obtain a reply sentence of the voice information.

7. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the processor, the one or more programs comprising instructions for performing the steps of the method of any of claims 1-5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any of claims 1-5.