Disclosure of Invention
The embodiment of the application aims to provide a voice recognition method, a voice recognition device, computer equipment and a storage medium, so as to solve the problems of low voice recognition accuracy and low recognition efficiency in the prior art.
In order to solve the above technical problems, the embodiment of the present application provides a voice recognition method, which adopts the following technical scheme:
Matching a service decoding diagram and a static decoding diagram corresponding to voice recognition according to a service scene, wherein the static decoding diagram is formed by constructing the service decoding diagram and a basic decoding diagram;
Acquiring a voice to be recognized and a client hotword list corresponding to the voice to be recognized;
decoding the voice to be recognized through the service decoding graph to obtain a preliminary decoding result;
If the client hotword list contains client hotwords, constructing a client decoding graph according to the client hotwords in the client hotword list, constructing a fusion decoding graph according to the client decoding graph and the static decoding graph, and taking the fusion decoding graph as a target decoding graph;
If the client hotword list does not contain the client hotword, the static decoding graph is used as a target decoding graph;
and decoding the preliminary decoding result through the target decoding graph to obtain a target decoding result.
Further, before the step of matching the service decoding graph and the static decoding graph corresponding to the voice recognition according to the service scene, the method further includes:
Acquiring fusion types of the service decoding graph and the basic decoding graph;
and determining a specific expression of the static decoding graph according to the fusion type.
Further, the step of determining the specific expression of the static decoding graph according to the fusion type includes:
If the fusion type is a linear fusion type, the specific expression of the static decoding graph is C (sG(w|H),sB(w|H))=α1*sG(w|H)+β1*sB (w|H);
If the fusion type is an exponential linear fusion type, the specific expression of the static decoding graph is as follows C(sG(w|H),sB(w|H))=-log(α1*exp(-sG(w|H))+β1*exp(sB(w|H)));
Wherein, α1 and β1 are variables in the specific expression of the static decoding graph, sG (w|h) is a score output by the service decoding graph based on the historical decoding state, and sB (w|h) is a score output by the base decoding graph based on the historical decoding state.
Further, the step of decoding the preliminary decoding result through the target decoding graph includes:
If the target decoding graph is a fusion decoding graph, decoding the preliminary decoding result through a first formula s (w|h) = -log (alpha2*exp(-C(sG(w|H),sB(w|H)))+β2*exp(ss (w|h))) of the target decoding graph, wherein alpha2 and beta2 are variables, and ss (w|h) is a score output by the client decoding graph based on a historical decoding state;
If the target decoding diagram is a static decoding diagram, passing through a second formula of the target decoding diagramDecoding the preliminary decoding result, wherein sG (w|H) is a score output by the service decoding graph based on the historical decoding state;
in the first formula and the second formula of the target decoding graph, s (w|h) is the target decoding result, and C (sG(w|H),sB (w|h)) is the score output by the static decoding graph based on the historical decoding state.
Further, the step of decoding the speech to be recognized through the service decoding graph includes:
extracting audio features from the speech to be recognized;
converting the audio features into a sequence of phonemes by an acoustic model;
and decoding the phoneme sequence through the service decoding graph.
Further, after the step of obtaining the target decoding result, the method further includes:
And extracting the new client hotword from the target decoding result, and updating the new client hotword into the client hotword list.
Further, the step of updating the new client hotword into the client hotword list includes:
if the client hotword list is not matched with the client hotword corresponding to the new client hotword, the new client hotword is added to the client hotword list;
And if the client hotword corresponding to the new client hotword is matched in the client hotword list, not modifying the client hotword corresponding to the new client hotword in the client hotword list.
In order to solve the above technical problems, the embodiment of the present application further provides a voice recognition device, which adopts the following technical scheme:
the decoding diagram matching module is used for matching the business decoding diagram and the static decoding diagram corresponding to the voice recognition according to the business scene, wherein the static decoding diagram is formed by constructing the business decoding diagram and the basic decoding diagram;
The system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring voice to be recognized and a client hotword list corresponding to the voice to be recognized;
the preliminary decoding module is used for decoding the voice to be recognized through the service decoding graph to obtain a preliminary decoding result;
The first determining module is used for constructing a client decoding graph according to the client hotword in the client hotword list if the client hotword list contains the client hotword, constructing a fusion decoding graph according to the client decoding graph and the static decoding graph, and taking the fusion decoding graph as a target decoding graph;
A second determining module, configured to take the static decoding graph as a target decoding graph if the client hotword list does not contain client hotwords, and
And the target decoding module is used for decoding the preliminary decoding result through the target decoding graph to obtain a target decoding result.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
Comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the speech recognition method as described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
the computer readable storage medium has stored thereon computer readable instructions which when executed by a processor implement the steps of the speech recognition method as described above.
Compared with the prior art, the method and the device have the advantages that a service decoding diagram and a static decoding diagram corresponding to voice recognition are matched according to a service scene, the static decoding diagram is formed by constructing the service decoding diagram and a basic decoding diagram, voice to be recognized and a client hotword list corresponding to the voice to be recognized are obtained, the voice to be recognized is decoded through the service decoding diagram to obtain a preliminary decoding result, if the client hotword list contains client hotwords, a client decoding diagram is constructed according to the client hotwords in the client hotword list, a fusion decoding diagram is constructed according to the client decoding diagram and the static decoding diagram, the fusion decoding diagram is used as a target decoding diagram, and if the client hotwords are not contained in the client hotword list, the static decoding diagram is used as a target decoding diagram, and the preliminary decoding result is decoded through the target decoding diagram to obtain the target decoding result. In the application, firstly, the voice to be recognized is decoded through the service decoding graph, so that the preliminary decoding result obtained by decoding accords with the corpus of a client in the current service scene, the voice recognition accuracy and recognition efficiency are improved, then, a corresponding target decoding graph is determined according to whether the client hotword list comprises the client hotword or not, so that the final target decoding result accords with the speaking habit of the client, the voice recognition accuracy is further improved, meanwhile, the client hotword list is a lightweight character list due to the existence of the basic decoding graph, and the client hotword list is combined with whether the client hotword list comprises the client hotword or not to determine whether to be fused to form a fusion decoding graph so as to flexibly adapt to the corresponding use scene, and the influence on the voice recognition efficiency is reduced.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the voice recognition method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the voice recognition device is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of a method of speech recognition according to the present application is shown. The voice recognition method comprises the following steps:
Step S201, matching a service decoding diagram and a static decoding diagram corresponding to voice recognition according to a service scene, wherein the static decoding diagram is formed by constructing the service decoding diagram and a basic decoding diagram.
In practical application, a plurality of service decoding graphs can be pre-trained, wherein one service decoding graph corresponds to one service scene.
The basic decoding diagram is formed by using characters and words in a language model as dictionaries, segmenting basic hot words of each scene in a scene basic hot word list, constructing an AC automaton and then converting according to a preset weight proportion, and compared with the service decoding diagram, the basic decoding diagram has the largest characters and words.
According to the service scene, the service decoding diagram and the basic decoding diagram of the current voice recognition are matched, so that the accuracy and the recognition efficiency of the voice recognition can be improved.
Step S202, obtaining a voice to be recognized and a client hotword list corresponding to the voice to be recognized.
Specifically, the electronic device (such as the server/terminal device shown in fig. 1) on which the voice recognition method operates may receive, through a wired connection manner or a wireless connection manner, the voice to be recognized and the client hotvocabulary corresponding to the voice to be recognized, which are sent by the client/service terminal. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
The client hotword list and the clients of the voice to be recognized are in a corresponding relation, N client hotwords are included in the client hotword list, wherein N is more than or equal to 0, N is an integer, and the client hotwords are characterized as personalized corpus of the clients.
And step S203, decoding the voice to be recognized through the service decoding graph to obtain a preliminary decoding result.
Specifically, the service decoding diagram is an FST diagram defined by an operator, and based on HCLG, the speech to be recognized is decoded by the service decoding diagram, and the target text corresponding to the speech to be recognized is recalled to form a preliminary decoded text (preliminary decoding result).
And step S204, if the client hotword list contains client hotwords, constructing a client decoding graph according to the client hotwords in the client hotword list, constructing a fusion decoding graph according to the client decoding graph and the static decoding graph, and taking the fusion decoding graph as a target decoding graph.
Specifically, each client hotword in the client hotword list is decomposed into a word and a word through a language model (such as an N-garm language model), an AC automaton is constructed according to the decomposed word and word of each client hotword, and then the AC automaton is converted into a client decoding diagram according to a preset weight relation. Wherein the preset weight relationship is characterized by the composition weights of the words and the words in the client hotwords, such as the "known" weight is higher than the "known" weight in the preset weight relationship
The client hotword list is not empty, at least one client hotword is included in the client hotword list, and the fusion decoding diagram is constructed through the client decoding diagram and the static decoding diagram, so that in the actual decoding process, the fusion decoding diagram can be decoded according to individuation of the client, and the accuracy of voice recognition is improved.
Step S205, if the client hotword list does not include the client hotword, determining the static decoding graph as the target decoding graph.
The client hotword list is empty, and is characterized in that client hotwords are not included in the client hotword list, and a client decoding diagram is not required to be constructed at the moment, and fusion of the client decoding diagram and the static decoding diagram is carried out, so that a preliminary decoding result is only decoded through the static decoding diagram, a corresponding use scene is flexibly adapted, and the influence on voice recognition efficiency is reduced.
And step S206, decoding the preliminary decoding result through the target decoding graph to obtain a target decoding result.
Specifically, in practical application, based on a weighted finite state transducer WFST (WEIGHTED FINITE-state transducer), decoding the preliminary decoding result through a target decoding graph, synthesizing the weight ratio of each decoding graph to obtain the text with the highest score, forming the target decoding result, and then determining text information according to the target decoding result.
In the application, firstly, the voice to be recognized is decoded through the service decoding graph, so that the preliminary decoding result obtained by decoding accords with the corpus of a client in the current service scene, the voice recognition accuracy and recognition efficiency are improved, then, a corresponding target decoding graph is determined according to whether the client hotword list comprises the client hotword or not, so that the final target decoding result accords with the speaking habit of the client, the voice recognition accuracy is further improved, meanwhile, the client hotword list is a lightweight character list due to the existence of the basic decoding graph, and the client hotword list is combined with whether the client hotword list comprises the client hotword or not to determine whether to be fused to form a fusion decoding graph so as to flexibly adapt to the corresponding use scene, and the influence on the voice recognition efficiency is reduced.
In some optional implementations of this embodiment, step S201 above, before the step of matching the service decoding graph and the static decoding graph corresponding to the speech recognition according to the service scene, further includes:
Acquiring fusion types of the service decoding graph and the basic decoding graph;
and determining a specific expression of the static decoding graph according to the fusion type.
Specifically, the fusion types include a linear fusion type (LL) and an exponential linear fusion type (LIN), wherein the result of calculating the exponential linear fusion type (LIN) is more accurate than the result of calculating the C (sG(w|H),sB (w|H)) of the linear fusion type (LL).
In some optional implementations of this embodiment, the step of determining the specific expression of the static decoding graph according to the fusion type includes:
If the fusion type is a linear fusion type, the specific expression of the static decoding graph is C (sG(w|H),sB(w|H))=α1*sG(w|H)+β1*sB (w|H);
If the fusion type is an exponential linear fusion type, the specific expression of the static decoding graph is as follows C(sG(w|H),sB(w|H))=-log(α1*exp(-sG(w|H))+β1*exp(sB(w|H)));
Wherein, α1 and β1 are variables in the specific expression of the static decoding graph, sG (w|h) is a score output by the service decoding graph based on the historical decoding state, and sB (w|h) is a score output by the base decoding graph based on the historical decoding state.
Specifically, α1 and β1 are both variables, and the sum of α1 and β1 is 1, so that the weight ratio of sG (w|h) and sB (w|h) in the specific expression of C (sG(w|H),sB (w|h)) can be adjusted according to the actual situation by the magnitude of α1 and β1, and if α1 is greater than β1, the speaking habit in the current business scenario is characterized as the result finally obtained by calculating the specific expression of C (sG(w|H),sB (w|h)).
In some optional implementations of this embodiment, step S205, the step of decoding the preliminary decoding result through the target decoding graph includes:
If the target decoding graph is a fusion decoding graph, decoding the preliminary decoding result through a first formula s (w|h) = -log (alpha2*exp(-C(sG(w|H),sB(w|H)))+β2*exp(ss (w|h))) of the target decoding graph, wherein alpha2 and beta2 are variables, and ss (w|h) is a score output by the client decoding graph based on a historical decoding state;
If the target decoding diagram is a static decoding diagram, passing through a second formula of the target decoding diagramDecoding the preliminary decoding result, wherein sG (w|H) is a score output by the service decoding graph based on the historical decoding state;
in the first formula and the second formula of the target decoding graph, s (w|h) is the target decoding result, and C (sG(w|H),sB (w|h)) is the score output by the static decoding graph based on the historical decoding state.
Specifically, in the first formula of the target decoding diagram, α2 and β2 are both variables, and the sum of α2 and β2 is 1, so that the weight ratio of C (sG(w|H),sB (w|h)) to ss (w|h) in the first formula of the target decoding diagram can be adjusted according to the actual situation through the sizes of α2 and β2, and if α2 is smaller than β2, the s (w|h) finally calculated through the first formula of the target decoding diagram is characterized as more conforming to the speaking habit of the client.
In the second formula of the target decoding diagram, B is characterized as a dictionary of language models, the basic decoding diagram is constructed by the dictionary of the language models, ifIf the dictionary is characterized as a language model, there is no phrase (w|h), s (w|h) =sG (w|h), whereas if the dictionary is characterized as a language model, if the dictionary is characterized as if (w|h) ∈b, there is a phrase (w|h), s (w|h) =c (sG(w|H),sB (w|h)).
For example, if w is "business" in (w|H), and H is "business", judging whether the word of "business" is included in the dictionary of the language model, if yes, thenS (w|h) =sG (w|h), if not, if (w|h) ∈b, s (w|h) =c (sG(w|H),sB (w|h)).
In some optional implementations of this embodiment, step S203, the step of decoding the speech to be recognized through the service decoding graph includes:
extracting audio features from the speech to be recognized;
converting the audio features into a sequence of phonemes by an acoustic model;
and decoding the phoneme sequence through the service decoding graph.
Specifically, after the voice to be recognized is obtained, at least one audio feature (mel-frequency cepstrum coefficient (MFCC)) is extracted from the voice to be recognized, each audio feature in the voice to be recognized is converted into a state sequence/phoneme sequence through a pre-trained acoustic model, and then the phoneme sequence is decoded through a service decoding graph.
In some optional implementations of this embodiment, step S206, after the step of obtaining the target decoding result, further includes:
And extracting the new client hotword from the target decoding result, and updating the new client hotword into the client hotword list.
Specifically, after the target decoding result is obtained each time, word segmentation processing is carried out on text information in the target decoding result, new client hotwords are extracted, so that client hotword lists are replaced, client hotword lists are perfected, the accuracy of subsequent voice recognition is effectively improved, and client experience is improved.
It should be noted that after word segmentation is performed on the text information, keyword determination can be performed on new client hotwords obtained by each word segmentation, the score of each new client hotword is determined according to a preset mapping relation, the new client hotword serving as the keyword is determined according to the score of each new client hotword, and the new client hotword serving as the keyword is newly added to the client hotword list, so that the situation that the client hotword list is too redundant in subsequent voice recognition is avoided, and the voice recognition accuracy is ensured while the voice recognition efficiency is improved.
In some optional implementations of this embodiment, the step of updating the new client hotword into the client hotword list includes:
if the client hotword list is not matched with the client hotword corresponding to the new client hotword, the new client hotword is added to the client hotword list;
And if the client hotword corresponding to the new client hotword is matched in the client hotword list, not modifying the client hotword corresponding to the new client hotword in the client hotword list.
Specifically, when the client hotword corresponding to the new client hotword is not matched in the client hotword list, the client hotword list is characterized as not containing the new client hotword, and the client hotword list is further optimized by adding the new client hotword to the client hotword list, so that the client hotword list adaptability is improved, and the accuracy of voice recognition is effectively ensured.
When the client hotword list is matched with the client hotword corresponding to the new client hotword, the client hotword list is characterized as containing the new client hotword, and the client hotword list is not modified at the moment.
It should be emphasized that, to further ensure the privacy and security of the static and client decoding graphs, the static and client decoding graph information may also be stored in nodes of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of a speech recognition apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 3, the speech recognition apparatus 300 according to the present embodiment includes a decoding map matching module 301, an obtaining module 302, a preliminary decoding module 303, a first determining module 304, a second determining module 305, and a target decoding module 306. Wherein:
The decoding diagram matching module 301 is configured to match a service decoding diagram and a static decoding diagram corresponding to speech recognition according to a service scenario, where the static decoding diagram is formed by constructing the service decoding diagram and a basic decoding diagram;
The obtaining module 302 is configured to obtain a voice to be recognized and a client hotword list corresponding to the voice to be recognized;
the preliminary decoding module 303 is configured to decode the speech to be identified according to the service decoding graph, so as to obtain a preliminary decoding result;
A first determining module 304, configured to construct a client decoding graph according to the client hotword in the client hotword list, and construct a fusion decoding graph according to the client decoding graph and the static decoding graph, when the client hotword list includes the client hotword, and take the fusion decoding graph as a target decoding graph;
A second determining module 305, configured to take the static decoding graph as a target decoding graph if the client hotword list does not contain the client hotword;
And the target decoding module 306 is configured to decode the preliminary decoding result according to the target decoding graph to obtain a target decoding result.
In the application, firstly, the voice to be recognized is decoded through the service decoding graph, so that the preliminary decoding result obtained by decoding accords with the corpus of a client in the current service scene, the voice recognition accuracy and recognition efficiency are improved, then, a corresponding target decoding graph is determined according to whether the client hotword list comprises the client hotword or not, so that the final target decoding result accords with the speaking habit of the client, the voice recognition accuracy is further improved, meanwhile, the client hotword list is a lightweight character list due to the existence of the basic decoding graph, and the client hotword list is combined with whether the client hotword list comprises the client hotword or not to determine whether to be fused to form a fusion decoding graph so as to flexibly adapt to the corresponding use scene, and the influence on the voice recognition efficiency is reduced.
In some optional implementations of the present embodiment, a type obtaining module and a third determining module are further included. Wherein:
the type acquisition module is used for acquiring the fusion type of the service decoding graph and the basic decoding graph;
and the third determining module is used for determining the specific expression of the static decoding graph according to the fusion type.
In some optional implementations of the present embodiment, the third determining module includes a first determining sub-module and a second determining sub-module, where:
a first determining submodule, configured to, if the fusion type is a linear fusion type, set a specific expression of the static decoding graph to C (sG(w|H),sB(w|H))=α1*sG(w|H)+β1*sB (w|h);
a second determination submodule, configured to, if the fusion type is a log-linear fusion type, make a specific expression of the static decoding graph be C(sG(w|H),sB(w|H))=-log(α1*exp(-sG(w|H))+β1*exp(sB(w|H))).
In the specific expression of the static decoding diagram, alpha1 and beta1 are variables, sG (w|H) is a score output by the service decoding diagram based on the historical decoding state, and sB (w|H) is a score output by the base decoding diagram based on the historical decoding state.
In some alternative implementations of this embodiment, the target decoding module 306 includes a first decoding submodule and a second decoding submodule. Wherein:
A first decoding submodule, configured to decode the preliminary decoding result according to a first formula s (w|h) = -log (α2*exp(-C(sG(w|H),sB(w|H)))+β2*exp(ss (w|h))) of the target decoding graph if the target decoding graph is a fusion decoding graph, where α2 and β2 are variables, and ss (w|h) is a score output by the client decoding graph based on a historical decoding state;
a second decoding sub-module, configured to pass through a second formula of the target decoding graph if the target decoding graph is a static decoding graphAnd decoding the preliminary decoding result, wherein sG (w|H) is a score output by the service decoding graph based on the historical decoding state.
In the first and second formulas of the target decoding graph, s (w|h) is the target decoding result, and C (sG(w|H),sB (w|h)) is the score output by the static decoding graph based on the historical decoding status.
In some optional implementations of this embodiment, the preliminary decoding module 203 includes a feature extraction sub-module, a sequence conversion sub-module, and a sequence decoding sub-module. Wherein:
the feature extraction submodule is used for extracting audio features from the voice to be recognized;
a sequence conversion sub-module for converting the audio features into a sequence of phonemes by an acoustic model;
And the sequence decoding submodule is used for decoding the phoneme sequence through the service decoding graph.
In some optional implementations of the present embodiment, a hotword update module is further included. Wherein:
And the hotword updating module is used for extracting new client hotwords from the target decoding result and updating the new client hotwords into the client hotword list.
In some optional implementations of this embodiment, the hotword updating module includes a first updating sub-module and a second updating sub-module. Wherein:
the first updating sub-module is used for adding the new client hotword into the client hotword list if the client hotword corresponding to the new client hotword is not matched in the client hotword list;
and the second updating sub-module is used for not modifying the client hotword corresponding to the new client hotword in the client hotword list if the client hotword corresponding to the new client hotword is matched in the client hotword list.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a voice recognition method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the speech recognition method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
In the application, firstly, the voice to be recognized is decoded through the service decoding graph, so that the preliminary decoding result obtained by decoding accords with the corpus of a client in the current service scene, the voice recognition accuracy and recognition efficiency are improved, then, a corresponding target decoding graph is determined according to whether the client hotword list comprises the client hotword or not, so that the final target decoding result accords with the speaking habit of the client, the voice recognition accuracy is further improved, meanwhile, the client hotword list is a lightweight character list due to the existence of the basic decoding graph, and the client hotword list is combined with whether the client hotword list comprises the client hotword or not to determine whether to be fused to form a fusion decoding graph so as to flexibly adapt to the corresponding use scene, and the influence on the voice recognition efficiency is reduced.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the speech recognition method as described above.
In the application, firstly, the voice to be recognized is decoded through the service decoding graph, so that the preliminary decoding result obtained by decoding accords with the corpus of a client in the current service scene, the voice recognition accuracy and recognition efficiency are improved, then, a corresponding target decoding graph is determined according to whether the client hotword list comprises the client hotword or not, so that the final target decoding result accords with the speaking habit of the client, the voice recognition accuracy is further improved, meanwhile, the client hotword list is a lightweight character list due to the existence of the basic decoding graph, and the client hotword list is combined with whether the client hotword list comprises the client hotword or not to determine whether to be fused to form a fusion decoding graph so as to flexibly adapt to the corresponding use scene, and the influence on the voice recognition efficiency is reduced.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.