CN109308896B

Movatterモバイル変換

Info

Publication number: CN109308896B
Application number: CN201710633042.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Jiangsu Huitong Jinke Data Co ltd
Current assignee: Jiangsu Huitong Jinke Data Co ltd
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2022-04-15
Anticipated expiration: 2037-07-28
Also published as: CN109308896A; WO2019019667A1

Abstract

The invention discloses a voice processing method and device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring voice vectors at a plurality of moments in a preset time period; processing the voice vectors at multiple moments by using a preset voice model to obtain multiple text messages corresponding to the voice vectors at multiple moments, wherein the preset voice model processes the voice vectors at multiple moments on the basis of prestored parameter vectors at multiple moments; and outputting a plurality of text messages. The invention solves the technical problem of low processing efficiency of the voice processing method in the prior art.

Description

Voice processing method and device, storage medium and processor

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for processing speech, a storage medium, and a processor.

Background

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a user and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will relate to natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics, but has important difference.

Currently, common natural language processing methods include: a conditional random field CRF, a hidden markov model HMM, a recurrent neural network model RNN, a long-short term memory model LSTM, and the like, but in order to improve processing accuracy, it is necessary to increase the model depth, resulting in high processing complexity and low processing efficiency.

Aiming at the problem of low processing efficiency of a voice processing method in the prior art, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present invention provide a voice processing method and apparatus, a storage medium, and a processor, so as to at least solve the technical problem of low processing efficiency of a voice processing method in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a speech processing method, including: acquiring voice vectors at a plurality of moments in a preset time period; processing the voice vectors at multiple moments by using a preset voice model to obtain multiple text messages corresponding to the voice vectors at multiple moments, wherein the preset voice model processes the voice vectors at multiple moments on the basis of prestored parameter vectors at multiple moments; and outputting a plurality of text messages.

Further, the presetting of the voice model includes: the voice processing model is used for processing the voice vectors at multiple moments based on the parameter vectors at multiple moments to obtain multiple text messages corresponding to the voice vectors at multiple moments.

Further, processing the speech vectors at multiple moments by using a preset speech model to obtain multiple text messages corresponding to the speech vectors at multiple moments, including: acquiring first parameter vectors at a plurality of moments from a parameter matrix according to reading operation; modifying the voice processing model by using the first parameter vectors at a plurality of moments to obtain a modified voice processing model; and processing the voice vectors at a plurality of moments by using the corrected voice processing model to obtain a plurality of text messages.

Further, while processing the speech vectors at a plurality of times by using the corrected speech processing model to obtain a plurality of text messages, the method further includes: obtaining second parameter vectors at a plurality of moments by using the corrected voice processing model; and writing the second parameter vector of a plurality of moments into the parameter matrix according to the writing operation.

Further, obtaining a second parameter vector at a plurality of time instants by using the modified speech processing model, including: and updating the first parameter vectors at multiple moments by using the corrected voice processing model to obtain second parameter vectors at multiple moments.

Further, before processing the speech vectors at multiple times by using a preset speech model to obtain multiple text messages corresponding to the speech vectors at multiple times, the method further includes: establishing an initial preset model, wherein the initial preset model comprises the following steps: a speech processing model and an initial parameter matrix; acquiring training data, wherein the training data comprises: a plurality of training speech vectors and text information corresponding to each training speech vector; and training the initial preset model according to the training data to obtain a preset voice model.

Further, training the initial preset model according to the training data to obtain a preset speech model comprises: inputting training data into a voice processing model to obtain a preset parameter vector; and writing the preset parameter vector into the initial parameter matrix through writing operation to obtain the parameter matrix.

Further, the speech processing model is an LSTM model, and the parameter matrix is a memory matrix.

Further, according to the processing capacity of the preset voice model, a preset time period is determined.

According to another aspect of the embodiments of the present invention, there is also provided a speech processing apparatus, including: the first acquisition module is used for acquiring voice vectors at a plurality of moments in a preset time period; the processing module is used for processing the voice vectors at multiple moments by utilizing a preset voice model to obtain multiple text messages corresponding to the voice vectors at multiple moments, wherein the preset voice model processes the voice vectors at multiple moments based on prestored parameter vectors at multiple moments; and the output module is used for outputting a plurality of text messages.

Further, the processing module includes: the obtaining submodule is used for obtaining first parameter vectors at a plurality of moments from the parameter matrix according to the reading operation; the correction submodule is used for correcting the voice processing model by using the first parameter vectors at a plurality of moments to obtain a corrected voice processing model; and the first processing submodule is used for processing the voice vectors at multiple moments by using the corrected voice processing model to obtain multiple text messages.

Further, the processing module further comprises: the second processing submodule is used for obtaining second parameter vectors at a plurality of moments by using the corrected voice processing model; and the first storage submodule is used for writing the second parameter vectors of a plurality of moments into the parameter matrix according to the writing operation.

Further, the second processing sub-module is further configured to update the first parameter vectors at multiple times by using the modified speech processing model, so as to obtain second parameter vectors at multiple times.

Further, the above apparatus further comprises: the establishing module is used for establishing an initial preset model, and the initial preset model comprises: a speech processing model and an initial parameter matrix; a second obtaining module, configured to obtain training data, where the training data includes: a plurality of training speech vectors and text information corresponding to each training speech vector; and the training module is used for training the initial preset model according to the training data to obtain a preset voice model.

Further, the training module comprises: the third processing submodule is used for inputting the training data into the voice processing model to obtain a preset parameter vector; and the second storage submodule is used for writing the preset parameter vector into the initial parameter matrix through write operation to obtain a preset voice model.

Further, the above apparatus further comprises: and the determining module is used for determining a preset time period according to the processing capacity of the preset voice model.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein when the program runs, an apparatus in which the storage medium is controlled to execute the voice processing method in the above embodiments.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the speech processing method in the above embodiments.

In the embodiment of the invention, the voice vectors at multiple moments in the preset time period are obtained, the preset voice model is utilized to process the voice vectors at multiple moments, multiple text messages corresponding to the voice vectors at multiple moments are obtained, and the multiple text messages are output, so that the natural language processing is realized. It is easy to note that, because the obtained speech vectors at multiple moments in the preset time period are obtained, and the preset speech model processes the speech vectors at multiple moments based on the prestored parameter vectors at multiple moments, the natural speech is processed by using the time-sequence characteristics of the natural speech and combining the memory matrix of the neural network machine and the LSTM model, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the scheme provided by the embodiment of the invention can achieve the effects of improving the processing efficiency, improving the processing accuracy, reducing the processing complexity and reducing the processing time.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a method of speech processing according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an alternative default speech model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative repeating module of a speech processing model according to an embodiment of the invention; and

fig. 4 is a schematic diagram of a speech processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a speech processing method, it should be noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flow chart of a speech processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, obtaining voice vectors of a plurality of moments in a preset time period.

Optionally, in the foregoing embodiment of the present invention, the preset time period is determined according to a processing capability of a preset speech model.

Specifically, the preset time period may be set according to the processing capability of the actual speech processing model, and the multiple times may be multiple sampling times with equal intervals, for example, if the preset time period is 100s and the sampling interval is 10s, then the speech vectors at 10 times may be acquired within 100 s.

And step S104, processing the voice vectors at multiple moments by using a preset voice model to obtain multiple text messages corresponding to the voice vectors at multiple moments, wherein the preset voice model processes the voice vectors at multiple moments based on the prestored parameter vectors at multiple moments.

Optionally, in the above embodiment of the present invention, the presetting the speech model includes: the voice processing model is used for processing the voice vectors at multiple moments based on the parameter vectors at multiple moments to obtain multiple text messages corresponding to the voice vectors at multiple moments.

Optionally, in the above embodiment of the present invention, the speech processing model is an LSTM model, and the parameter matrix is a memory matrix.

Specifically, the preset speech model may be a neural turing machine, as shown in fig. 2, the neural turing machine includes two components: the controller (namely the voice processing model) and the memory matrix (namely the parameter matrix), wherein the memory matrix is an external memory matrix and stores parameter vectors required by the voice processing model for voice processing, and the controller can read and write the parameter vectors in the memory matrix; the above-mentioned speech processing model may be LSTM model, which is a special type in RNN, and can learn long-term dependence information, LSTM avoids the long-term dependence problem by a deliberate design, specifically, LSTM has a chain form of repeating neural network modules as other RNNs, but unlike a single neural network layer, the repeating modules have a different structure, as shown in fig. 3, and may be composed of input gates, forget gates, output gates, and interact in a very special way, thereby solving the problem of gradient disappearance and gradient explosion of RNN.

Step S106, outputting a plurality of text messages.

In an optional scheme, according to the time sequence characteristics of natural voice, natural voice data at a plurality of sampling moments in a preset time period can be obtained, voice vectors at a plurality of moments in the preset time period are obtained, a pre-trained neural turing machine is obtained, the voice vectors at the plurality of moments are identified by the neural turing machine, corresponding text information is obtained, and the identified text information is output.

According to the embodiment of the invention, the voice vectors at a plurality of moments in the preset time period are obtained, the preset voice model is utilized to process the voice vectors at the plurality of moments, a plurality of text messages corresponding to the voice vectors at the plurality of moments are obtained, and the plurality of text messages are output, so that the natural language processing is realized. It is easy to note that, because the obtained speech vectors at multiple moments in the preset time period are obtained, and the preset speech model processes the speech vectors at multiple moments based on the prestored parameter vectors at multiple moments, the natural speech is processed by using the time-sequence characteristics of the natural speech and combining the memory matrix of the neural network machine and the LSTM model, and the technical problem of low processing efficiency of the speech processing method in the prior art is solved. Therefore, the scheme provided by the embodiment of the invention can achieve the effects of improving the processing efficiency, improving the processing accuracy, reducing the processing complexity and reducing the processing time.

Optionally, in the foregoing embodiment of the present invention, in step S104, processing the speech vectors at multiple times by using a preset speech model, to obtain multiple pieces of text information corresponding to the speech vectors at multiple times, where the step includes:

step S1040, obtaining a first parameter vector at multiple times from the parameter matrix according to the read operation.

Specifically, as shown in fig. 2, the neuropsychological machine may include: and the read head and the write head, wherein the read head can read the W parameters in the LSTM model from the memory matrix, and the write head can write the new W parameters into the memory matrix.

Step S1042, modifying the speech processing model by using the first parameter vectors at multiple times, so as to obtain a modified speech processing model.

And step S1044, processing the voice vectors at multiple moments by using the corrected voice processing model to obtain multiple text messages.

In an optional scheme, after obtaining speech vectors at multiple times, for a natural speech processing process at each time, a reading head may read a W parameter vector from a memory matrix, input the W parameter vector into an LSTM model, modify the LSTM model to obtain a modified LSTM model, may input the speech vector as an input vector into the modified LSTM model to obtain an output vector of the LSTM model, that is, text information of the speech vector, and after processing of all the speech vectors at multiple times, obtain multiple text information corresponding to the speech vectors at multiple times.

Optionally, in the foregoing embodiment of the present invention, in step S1044, the method further includes, while processing the speech vectors at multiple times by using the corrected speech processing model to obtain multiple text messages:

step S1046, obtaining a second parameter vector at a plurality of times by using the corrected speech processing model.

Optionally, in the foregoing embodiment of the present invention, in step S1046, obtaining the second parameter vector at multiple time points by using the modified speech processing model, includes:

step S10462, updating the first parameter vectors at multiple times by using the corrected speech processing model, and obtaining second parameter vectors at multiple times.

And step S1048, writing the second parameter vectors of a plurality of moments into the parameter matrix according to the writing operation.

In an alternative scheme, for the natural speech processing process at each time, in the process of processing the speech vector by using the LSTM model, not only the text information of the speech vector but also a new W parameter vector can be obtained, and the new W parameter vector is written into the memory matrix by the writing head to be used as the W parameter vector at the next time.

Optionally, in the foregoing embodiment of the present invention, before the step S104, processing the speech vectors at multiple times by using a preset speech model to obtain multiple pieces of text information corresponding to the speech vectors at multiple times, the method further includes:

step S108, establishing an initial preset model, wherein the initial preset model comprises the following steps: a speech processing model and an initial parameter matrix.

Step S110, acquiring training data, wherein the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector.

And step S112, training the initial preset model according to the training data to obtain a preset voice model.

In an optional scheme, an LSTM model in the neurographic machine may be pre-established according to actual processing requirements, a W parameter vector in a memory matrix is set as an initial value, and then the pre-established neurographic machine is trained according to training data to obtain a neurographic machine with higher accuracy.

Optionally, in the foregoing embodiment of the present invention, in step S112, training the initial preset model according to the training data, and obtaining the preset speech model includes:

step S1122, inputting the training data into the speech processing model to obtain a preset parameter vector.

In step S1124, the preset parameter vector is written into the initial parameter matrix by a write operation, so as to obtain a parameter matrix.

In an optional scheme, in order to obtain a neural turing machine with higher accuracy, a plurality of training speech vectors in training data may be used as input vectors, text information corresponding to each training speech vector is used as an output vector, the output vectors are input into the LSTM model to obtain a preset W parameter vector of the LSTM model, and the preset W parameter vector is written into the memory matrix through a writing head, so that the neural turing machine with higher accuracy is obtained.

Example 2

According to an embodiment of the present invention, there is provided an embodiment of a speech processing apparatus.

Fig. 4 is a schematic diagram of a speech processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:

the first obtaining module 41 is configured to obtain speech vectors at multiple moments in a preset time period.

Optionally, in the above embodiment of the present invention, the apparatus further includes: and the determining module is used for determining a preset time period according to the processing capacity of the preset voice model.

Specifically, the preset time period may be set according to the processing capability of the model, and the multiple times may be multiple sampling times with equal intervals, for example, if the preset time period is 100s and the sampling interval is 10s, then within 100s, the speech vectors at 10 times may be acquired.

And a processing module 43, configured to process the speech vectors at multiple times by using a preset speech model, so as to obtain multiple pieces of text information corresponding to the speech vectors at multiple times, where the preset speech model processes the speech vectors at multiple times based on pre-stored parameter vectors at multiple times.

And an output module 45, configured to output a plurality of text messages.

Optionally, in the above embodiment of the present invention, the processing module 43 includes:

and the obtaining submodule is used for obtaining the first parameter vectors at a plurality of moments from the parameter matrix according to the reading operation.

And the correction submodule is used for correcting the voice processing model by using the first parameter vectors at a plurality of moments to obtain a corrected voice processing model.

And the first processing submodule is used for processing the voice vectors at multiple moments by using the corrected voice processing model to obtain multiple text messages.

Optionally, in the above embodiment of the present invention, the processing module 43 further includes:

and the second processing submodule is used for obtaining second parameter vectors at a plurality of moments by using the corrected voice processing model.

Optionally, in the foregoing embodiment of the present invention, the second processing sub-module is further configured to update the first parameter vectors at multiple times by using the modified speech processing model, so as to obtain second parameter vectors at multiple times.

And the first storage submodule is used for writing the second parameter vectors of a plurality of moments into the parameter matrix according to the writing operation.

Optionally, in the above embodiment of the present invention, the apparatus further includes:

the establishing module is used for establishing an initial preset model, and the initial preset model comprises: a speech processing model and an initial parameter matrix.

A second obtaining module, configured to obtain training data, where the training data includes: a plurality of training speech vectors, and text information corresponding to each training speech vector.

And the training module is used for training the initial preset model according to the training data to obtain a preset voice model.

Optionally, in the above embodiment of the present invention, the training module includes:

and the third processing submodule is used for inputting the training data into the voice processing model to obtain a preset parameter vector.

And the second storage submodule is used for writing the preset parameter vector into the initial parameter matrix through write operation to obtain the parameter matrix.

Example 3

According to an embodiment of the present invention, there is provided an embodiment of a storage medium including a stored program, wherein an apparatus in which the storage medium is located is controlled to execute the voice processing method in the above-described embodiment 1 when the program runs.

Example 4

According to an embodiment of the present invention, an embodiment of a processor for running a program is provided, where the program is run to execute the voice processing method in embodiment 1.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection according to some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of speech processing, comprising:

acquiring voice vectors at a plurality of moments in a preset time period;

processing the voice vectors at the multiple moments by using a preset voice model to obtain multiple text messages corresponding to the voice vectors at the multiple moments, wherein the preset voice model processes the voice vectors at the multiple moments based on prestored parameter vectors at the multiple moments, and the preset voice model comprises: a speech processing model;

outputting the plurality of text messages;

wherein, the method also comprises:

establishing an initial preset model, wherein the initial preset model comprises the following steps: the speech processing model and an initial parameter matrix;

obtaining training data, wherein the training data comprises: a plurality of training speech vectors and text information corresponding to each training speech vector;

and training the initial preset model according to the training data to obtain the preset voice model.

2. The method of claim 1, wherein the presetting the speech model further comprises: the speech processing model is used for processing the speech vectors at the multiple moments based on the parameter vectors at the multiple moments to obtain multiple text messages corresponding to the speech vectors at the multiple moments.

3. The method of claim 2, wherein processing the speech vectors at the multiple time instants using a preset speech model to obtain multiple text messages corresponding to the speech vectors at the multiple time instants comprises:

acquiring first parameter vectors of the multiple moments from the parameter matrix according to reading operation;

modifying the voice processing model by using the first parameter vectors at the plurality of moments to obtain a modified voice processing model;

and processing the voice vectors at the multiple moments by using the corrected voice processing model to obtain the multiple text messages.

4. The method of claim 3, wherein while processing the speech vectors at the plurality of time instances using the modified speech processing model to obtain the plurality of text messages, the method further comprises:

obtaining second parameter vectors of the moments by using the corrected voice processing model;

and writing the second parameter vector of the plurality of moments into the parameter matrix according to the writing operation.

5. The method of claim 4, wherein obtaining the second parameter vector for the plurality of time instants using the modified speech processing model comprises:

and updating the first parameter vectors at the multiple moments by using the corrected voice processing model to obtain second parameter vectors at the multiple moments.

6. The method of claim 2, wherein training the initial pre-set model according to the training data to obtain the pre-set speech model comprises:

inputting the training data into the voice processing model to obtain a preset parameter vector;

and writing the preset parameter vector into the initial parameter matrix through write operation to obtain the parameter matrix.

7. The method according to any of claims 2 to 6, wherein the speech processing model is an LSTM model and the parameter matrix is a memory matrix.

8. The method of claim 1, wherein the predetermined time period is determined according to a processing capability of the predetermined speech model.

9. A speech processing apparatus, comprising:

the first acquisition module is used for acquiring voice vectors at a plurality of moments in a preset time period;

a processing module, configured to process the speech vectors at the multiple moments by using a preset speech model to obtain multiple pieces of text information corresponding to the speech vectors at the multiple moments, where the preset speech model processes the speech vectors at the multiple moments based on pre-stored parameter vectors at the multiple moments, and the preset speech model includes: a speech processing model;

the output module is used for outputting the plurality of text messages;

the device is also used for establishing an initial preset model, wherein the initial preset model comprises: the speech processing model and an initial parameter matrix; obtaining training data, wherein the training data comprises: a plurality of training speech vectors and text information corresponding to each training speech vector; and training the initial preset model according to the training data to obtain the preset voice model.

10. A storage medium, characterized in that the storage medium includes a stored program, wherein, when the program runs, a device in which the storage medium is located is controlled to execute the voice processing method according to any one of claims 1 to 8.

11. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the speech processing method according to any one of claims 1 to 8 when running.