Movatterモバイル変換


[0]ホーム

URL:


CN111161724A - Method, system, equipment and medium for Chinese audio-visual combined speech recognition - Google Patents

Method, system, equipment and medium for Chinese audio-visual combined speech recognition
Download PDF

Info

Publication number
CN111161724A
CN111161724ACN201911297060.3ACN201911297060ACN111161724ACN 111161724 ACN111161724 ACN 111161724ACN 201911297060 ACN201911297060 ACN 201911297060ACN 111161724 ACN111161724 ACN 111161724A
Authority
CN
China
Prior art keywords
character sequence
sequence
recognition model
audio
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911297060.3A
Other languages
Chinese (zh)
Other versions
CN111161724B (en
Inventor
郭永亮
张坤雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aiways Automobile Shanghai Co Ltd
Original Assignee
Aiways Automobile Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aiways Automobile Shanghai Co LtdfiledCriticalAiways Automobile Shanghai Co Ltd
Priority to CN201911297060.3ApriorityCriticalpatent/CN111161724B/en
Publication of CN111161724ApublicationCriticalpatent/CN111161724A/en
Application grantedgrantedCritical
Publication of CN111161724BpublicationCriticalpatent/CN111161724B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention provides a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice, wherein the method comprises the following steps: respectively receiving a video signal and an audio signal to be identified; inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model; and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model. The invention provides an audio-visual combined voice recognition scheme aiming at Chinese sentence level based on an end-to-end mode, combines a deep neural network and an attention mechanism, fully excavates and fuses the characteristics of an audio signal and a video signal, and is beneficial to improving the recognition capability of a voice recognition system.

Description

Method, system, equipment and medium for Chinese audio-visual combined speech recognition
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice.
Background
Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a process of recognition and understanding. In the last two decades, the development of speech recognition technology has made remarkable progress, and the development of artificial intelligence technology has started to move from laboratories to markets, especially in recent years, so that the research of speech recognition technology has made a major breakthrough. At present, the voice recognition technology is widely applied to the fields of vehicle-mounted systems, social chat, smart home and the like, provides much convenience for the life of people, and shows strong practicability of people.
In general, the input of speech recognition is only an audio signal, and if the speech recognition process can simultaneously utilize audio and video signals, the two signals can be mutually supplemented, so that the input information is richer, and the accuracy of recognition is improved.
For the speech recognition problem of audio-visual combination by simultaneously utilizing audio and video signals, at present, the solution specially aiming at Chinese is almost not available, most of the solutions aim at the speech recognition of English, or a specific language is not specified, but Chinese has the particularity, for example, words in Chinese do not have strict morphological change, the number of commonly used Chinese characters is large, about 3500 Chinese characters and the like, and the complex characteristics make the Chinese speech recognition task of audio-visual combination full of challenges, so the audio-visual combination speech recognition solution aiming at English or general audio-visual combination cannot be directly used for solving the Chinese speech recognition problem. In addition, in the existing audio-visual combined speech recognition scheme, the problem of recognition at a word level is solved, and a sentence-level recognition task during continuous speaking cannot be processed; some schemes use traditional machine learning methods to realize audio-visual combined speech recognition, and such methods need manual feature extraction, and have a complex process and limited final recognition effect.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a method, a system, equipment and a medium for recognizing Chinese audio-visual combined speech based on deep learning, and provides an end-to-end audio-visual combined speech recognition scheme aiming at Chinese sentence level.
The embodiment of the invention provides a method for recognizing Chinese audio-visual combined voice, which comprises the following steps:
respectively receiving a video signal and an audio signal to be identified;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
Optionally, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
Optionally, the video encoder extracts a feature sequence of the video signal, including the steps of:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.
Optionally, the audio encoder extracting a feature sequence of the audio signal includes:
calculating mel-frequency cepstrum coefficient values of the audio signal;
and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.
Optionally, the second recurrent neural network comprises three long-short term memory layers;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.
Optionally, the first decoder outputs a pinyin character sequence according to the fused feature sequence, and includes the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;
and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
Optionally, the chinese character sequence recognition model includes an encoder and a second decoder;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;
inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;
and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.
Optionally, the fourth recurrent neural network and the fifth recurrent neural network each include two gated recurrent unit layers.
Optionally, the method further includes training the pinyin character sequence recognition model and the hanzi sequence recognition model by:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.
The embodiment of the invention also provides a system for recognizing the combination of Chinese audio-visual and speech, which is characterized in that the system is applied to the method for recognizing the combination of Chinese audio-visual and speech, and comprises the following steps:
the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;
the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The embodiment of the invention also provides a device for recognizing Chinese audio-visual combined speech, which comprises:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The Chinese audio-visual combined voice recognition method, the system, the equipment and the medium provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for Chinese audio visual combined with speech recognition according to an embodiment of the present invention;
FIG. 2 is a flow diagram of the recognition of sentences from audio and video signals in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating a structure of a Pinyin character sequence recognition model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a Chinese character sequence recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese audio-visual system combined with speech recognition according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a Chinese audiovisual combined with speech recognition apparatus in accordance with an embodiment of the present invention;
fig. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As shown in fig. 1, in order to solve the problems of the prior art, the present invention provides a method for combining chinese audio-visual with speech recognition, which comprises the following steps:
s100: respectively receiving a video signal and an audio signal to be identified;
s200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
s300: and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined speech recognition method, which can directly obtain a corresponding Chinese character sequence, namely a sentence comprising speaking content, by inputting a video signal and an audio signal. Moreover, the invention is specially used for Chinese character recognition, the pinyin character sequence is obtained from the video signal and the audio signal through the step S200, then Chinese character recognition is carried out through the step S300, the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
As shown in fig. 2, the pinyin character sequence recognition model inputs a sequence of picture frames, i.e., a video signal, of an audio signal when a person speaks and a lip motion video of the speaker. The pinyin character sequence recognition model performs fusion processing on input audio signals and video signals, and finally outputs a pinyin character sequence, and the process can be expressed by the following formula:
p=av2p(a,v) (1)
wherein a ═ a1,a2,…,an) Representing a sequence of audio signals, v ═ v1,v2,…,vm) Picture frame sequence representing lip motion, p ═ p (p)1,p2,…,pk) Represents the recognized pinyin character sequence, and av2p (.) represents a pinyin character sequence recognition model that converts audio and video signals into pinyin character sequences.
The input of the Chinese character sequence recognition model is the pinyin character sequence output by the pinyin character sequence recognition model, and the Chinese character sequence is output by learning the characteristics of the pinyin character sequence, namely the Chinese character sequence represents the speaking content, and the process can be represented by the following formula:
s=p2s(p) (2)
wherein p ═ p (p)1,p2,…,pk) Representing the input pinyin character sequence, s ═ c1,c2,…,cl) Representing a recognized sequence of Chinese characters, i.e. sentences, ciRepresents the ith Chinese character in the sentence s, and p2s (.) represents the Chinese character sequence recognition model that converts the pinyin character sequence to a Chinese character sequence.
The following describes the working process of the pinyin character sequence recognition model and the hanzi sequence recognition model in a specific embodiment with reference to fig. 3 and 4.
As shown in fig. 3, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder. Specifically, in this embodiment, the pinyin character Sequence recognition model may be a Sequence to Sequence (Sequence to Sequence) model based on the attention mechanism.
The step S200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
s210: the video encoder extracts a characteristic sequence of the video signal;
s220: the audio encoder extracts a characteristic sequence of the audio signal;
s230: the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
s240: and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
The step S210: the video encoder extracting the characteristic sequence of the video signal comprises the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network; the convolutional neural Network may be a Residual Network (ResNet), and an input of the Residual Network may be a three-channel RGB image; convolutional Neural Networks (CNN) are a class of feed-forward neural networks that contain convolution computations and have a depth structure, and residual networks are characterized by being easily optimized and capable of increasing accuracy by adding a considerable depth. The inner residual block uses jump connection, so that the problem of gradient disappearance caused by depth increase in a deep neural network is solved;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal; in this embodiment, the first recurrent neural network may be a three-layer Long-Short Term Memory (LSTM) network, which is used to extract the time-series features between image frame sequences, and take the output sequence of the top-layer LSTM as the feature sequence of the video signal output by the video encoder. The recurrent neural network is a recurrent neural network which takes sequence data as input, recurs in the evolution direction of the sequence and all nodes (recurrent units) are connected in a chain manner; the long-term and short-term memory Network is a time-cycle Neural Network (RNN), and is a special RNN (Recurrent Neural Network) capable of learning long-term dependency.
The step S220: the audio encoder extracts the characteristic sequence of the audio signal, and comprises the following steps:
calculating Mel Frequency cepstrum coefficient values (MFCC) of the audio signal; Mel-Frequency Cepstral coeffients (Mel Frequency Cepstral coeffients) is a widely used feature in automated speech and speaker recognition;
inputting the mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting a characteristic sequence of the audio signal; in this embodiment, the second recurrent neural network may also be a three-layer long-short term memory network, and the state data output by the top layer is used as the audio signal.
The step S230: the audio encoder fuses the feature sequence of the video signal and the feature sequence of the audio signal through an attention mechanism, and the method comprises the following steps:
the audio encoder fuses state data of the top long short-term memory layer and the feature sequence of the video signal based on an attention mechanism in the top long short-term memory layer of the second recurrent neural network. Specifically, the fusion employs the following equations (3) and (4):
aij=score(valuej,queryi) (3)
Ci=∑jaijvaluej(4)
wherein, query represents the state data of the top LSTM of the audio encoder, value represents the output of the top LSTM of the video encoder, and the output of the top LSTM of the audio encoder is the fusion characteristic of the audio and video signals through the fusion of the attention mechanism.
The step S240: the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence, and the method comprises the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer; in this embodiment, the feature extraction layer in the third recurrent neural network may be a single-layer long-short term memory layer, and a four-head attention mechanism is used to improve the performance, the classification layer may be a softmax layer, and the classification layer outputs the predicted pinyin character sequence p ═ (p ═ p)1,p2,…,pk)。
And obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
When the pinyin character sequence recognition model is trained, the input of the model is g ═ g (g)1,g2,…,gk),giThe sample label representing the real sample is processed by an LSTM layer and a softmax layer, and finally, a predicted pinyin character sequence p ═ p (p) is output1,p2,…,pk). In the test phase, giRepresenting the output p of the network at time ii
As shown in fig. 4, in this embodiment, the kanji sequence recognition model includes an encoder and a second decoder. Specifically, in this embodiment, the chinese character sequence recognition model is also implemented based on the Seq2Seq framework. The step S300: inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder combines the pinyinInputting the character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence; the input of the encoder is a pinyin character sequence p ═ p (p)1,p2,…,pk),piIn this embodiment, the fourth Recurrent neural network includes two layers of Gated Recurrent Unit (GRU) networks, and the output sequence is O ═ O (O ═ g ═ O1,O2,…,Ok) Output at time iiFor parameterizing the input p at the next momenti+1The predicted distribution of (2); the gated cyclic unit is a commonly used gated cyclic neural network, and the gated cyclic neural network is proposed to better capture the dependence relationship with larger time step distance in a time sequence.
Inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer; in this embodiment, the fifth recurrent neural network may include two layers of gated recurrent units, and the classification layer may be a softmax layer;
obtaining the kanji sequence s ═ (c) from the output of the classification layer of the fifth recurrent neural network1,c2,…,cl)。
In the training stage of the Chinese character sequence recognition model, the input of the decoder is the Chinese character sequence y ═ y1,y2,…,yl),yiAnd inputting the real sample label into a fifth-cycle neural network for calculation.
The Chinese audio-visual combined speech recognition system constructed by the invention comprises two models: the pinyin character sequence recognition model and the Chinese character sequence recognition model are formed and are of a multi-model structure, and the problem of multi-model training needs to be solved for obtaining a final recognition system. In this embodiment, the method for recognizing Chinese audio-visual combined with speech further comprises training the pinyin character sequence recognition model and the hanzi sequence recognition model by the following steps:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model until the Chinese character sequence recognition model is converged;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model until the pinyin character sequence recognition model is converged;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model to converge, thus obtaining the whole Chinese audio-visual combined speech recognition system.
As shown in fig. 5, an embodiment of the present invention further provides a system for combining a chinese audio visual system with a speech recognition, where the system is applied to the method for combining a chinese audio visual system with a speech recognition, and the system includes:
a signal receiving module M100, configured to receive a video signal and an audio signal to be identified respectively;
the pinyin identification module M200 is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module M300 is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined voice recognition system, which inputs video signals and audio signals and can directly obtain corresponding Chinese character sequences by the method of the invention. Moreover, the invention is specially used for Chinese character recognition, a pinyin character sequence is obtained from a video signal and an audio signal through the pinyin recognition module M200, and then Chinese character recognition is carried out through the Chinese character recognition module M300, so that the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
The pinyin character sequence recognition model of the invention may have a structure as shown in fig. 3, and the chinese character sequence recognition model may have a structure as shown in fig. 4. The embodiment of the pinyin identification module M200 for obtaining the pinyin character sequence may adopt the specific implementation of the steps S210 to S240 as described above, but the invention is not limited thereto. The embodiment of the kanji sequence recognition module M300 for obtaining the kanji sequence may adopt the above-mentioned specific embodiment of step S300, but the present invention is not limited thereto.
The embodiment of the invention also provides a Chinese audio-visual combined voice recognition device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
Anelectronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. Theelectronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, theelectronic device 600 is embodied in the form of a general purpose computing device. The combination of theelectronic device 600 may include, but is not limited to: at least oneprocessing unit 610, at least onememory unit 620, abus 630 connecting different platform combinations (includingmemory unit 620 and processing unit 610), adisplay unit 640, etc.
Wherein the storage unit stores program code executable by theprocessing unit 610 to cause theprocessing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, theprocessing unit 610 may perform the steps as shown in fig. 1.
Thestorage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or acache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
Thememory unit 620 may also include a program/utility 6204 having a set (at least one) ofprogram modules 6205,such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
Theelectronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with theelectronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable theelectronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O)interface 650. Also, theelectronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via thenetwork adapter 660. Thenetwork adapter 660 may communicate with other modules of theelectronic device 600 via thebus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with theelectronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.
Referring to fig. 7, aprogram product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, compared with the prior art, the method, system, device and medium for recognizing Chinese audio-visual combined speech provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (12)

CN201911297060.3A2019-12-162019-12-16Method, system, equipment and medium for Chinese audio-visual combined speech recognitionActiveCN111161724B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911297060.3ACN111161724B (en)2019-12-162019-12-16Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911297060.3ACN111161724B (en)2019-12-162019-12-16Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Publications (2)

Publication NumberPublication Date
CN111161724Atrue CN111161724A (en)2020-05-15
CN111161724B CN111161724B (en)2022-12-13

Family

ID=70557201

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911297060.3AActiveCN111161724B (en)2019-12-162019-12-16Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Country Status (1)

CountryLink
CN (1)CN111161724B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112102830A (en)*2020-09-142020-12-18广东工业大学Coarse granularity instruction identification method and device
CN112786052A (en)*2020-12-302021-05-11科大讯飞股份有限公司Speech recognition method, electronic device and storage device
CN113033538A (en)*2021-03-252021-06-25北京搜狗科技发展有限公司Formula identification method and device
CN114519999A (en)*2022-02-152022-05-20平安科技(深圳)有限公司Speech recognition method, device, equipment and storage medium based on bimodal model

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20040117191A1 (en)*2002-09-122004-06-17Nambi SeshadriCorrelating video images of lip movements with audio signals to improve speech recognition
CN101825953A (en)*2010-04-062010-09-08朱建政Chinese character input product with combined voice input and Chinese phonetic alphabet input functions
CN102347026A (en)*2011-07-042012-02-08深圳市子栋科技有限公司Audio/video on demand method and system based on natural voice recognition
CN108073875A (en)*2016-11-142018-05-25广东技术师范学院A kind of band noisy speech identifying system and method based on monocular cam
CN109410918A (en)*2018-10-152019-03-01百度在线网络技术(北京)有限公司For obtaining the method and device of information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20040117191A1 (en)*2002-09-122004-06-17Nambi SeshadriCorrelating video images of lip movements with audio signals to improve speech recognition
CN101825953A (en)*2010-04-062010-09-08朱建政Chinese character input product with combined voice input and Chinese phonetic alphabet input functions
CN102347026A (en)*2011-07-042012-02-08深圳市子栋科技有限公司Audio/video on demand method and system based on natural voice recognition
CN108073875A (en)*2016-11-142018-05-25广东技术师范学院A kind of band noisy speech identifying system and method based on monocular cam
CN109410918A (en)*2018-10-152019-03-01百度在线网络技术(北京)有限公司For obtaining the method and device of information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢磊等: "一种基于数据筛的音频视频连续语音识别系统", 《计算机应用》*

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112102830A (en)*2020-09-142020-12-18广东工业大学Coarse granularity instruction identification method and device
CN112786052A (en)*2020-12-302021-05-11科大讯飞股份有限公司Speech recognition method, electronic device and storage device
CN112786052B (en)*2020-12-302024-05-31科大讯飞股份有限公司Speech recognition method, electronic equipment and storage device
CN113033538A (en)*2021-03-252021-06-25北京搜狗科技发展有限公司Formula identification method and device
CN113033538B (en)*2021-03-252024-05-10北京搜狗科技发展有限公司Formula identification method and device
CN114519999A (en)*2022-02-152022-05-20平安科技(深圳)有限公司Speech recognition method, device, equipment and storage medium based on bimodal model

Also Published As

Publication numberPublication date
CN111161724B (en)2022-12-13

Similar Documents

PublicationPublication DateTitle
US10380996B2 (en)Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
US11741355B2 (en)Training of student neural network with teacher neural networks
CN111161724B (en)Method, system, equipment and medium for Chinese audio-visual combined speech recognition
CN108170749B (en)Dialog method, device and computer readable medium based on artificial intelligence
US11610108B2 (en)Training of student neural network with switched teacher neural networks
CN111402861B (en)Voice recognition method, device, equipment and storage medium
CN114596844B (en)Training method of acoustic model, voice recognition method and related equipment
WO2021072875A1 (en)Intelligent dialogue generation method, device, computer apparatus and computer storage medium
US20150325240A1 (en)Method and system for speech input
CN109036391A (en)Audio recognition method, apparatus and system
WO2022134894A1 (en)Speech recognition method and apparatus, computer device, and storage medium
CN113392265B (en) Multimedia processing method, device and equipment
KR20170022445A (en)Apparatus and method for speech recognition based on unified model
CN113393841B (en)Training method, device, equipment and storage medium of voice recognition model
CN112883967B (en) Image character recognition method, device, medium and electronic equipment
CN117668292A (en)Cross-modal sensitive information identification method
CN110991175B (en)Method, system, equipment and storage medium for generating text in multi-mode
CN110263218B (en)Video description text generation method, device, equipment and medium
CN110930980A (en) An acoustic recognition model, method and system for Chinese-English mixed speech
WO2023093295A1 (en)Artificial intelligence-based audio processing method and apparatus, electronic device, computer program product, and computer-readable storage medium
CN114120975B (en) Method, device and storage medium for speech recognition punctuation recovery
CN116884391A (en)Multimode fusion audio generation method and device based on diffusion model
CN109726397B (en)Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN112562678B (en)Intelligent dialogue method, system, equipment and storage medium based on customer service recording
CN115438210B (en) Text image generation method, device, terminal and computer readable storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
PP01Preservation of patent right

Effective date of registration:20230131

Granted publication date:20221213

PP01Preservation of patent right
PD01Discharge of preservation of patent

Date of cancellation:20240108

Granted publication date:20221213

PD01Discharge of preservation of patent
PP01Preservation of patent right

Effective date of registration:20240227

Granted publication date:20221213

PP01Preservation of patent right

[8]ページ先頭

©2009-2025 Movatter.jp