CN115101051B

Movatterモバイル変換

Info

Publication number: CN115101051B
Application number: CN202210695916.8A
Authority: CN
Inventors: 胡劲松; 冯思铭; 杨皓晖; 连泽涛; 贺映玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2024-11-29
Anticipated expiration: 2042-06-20
Also published as: CN115101051A

Abstract

The invention discloses a multichannel voice recognition device and a voice-to-text method thereof, which are used for recognizing voice instructions and dialogs in a power dispatching process. The invention designs a three-sound-source device of a single sound card, which uses a Mic interface and a Line in interface of a common built-in sound card to solve the problem of the identification of the three sound sources, and simultaneously, the invention provides a difference frequency principle to automatically establish a special word stock of a difference frequency of local power and can identify special words of a local power department.

Description

Multi-channel voice recognition device and voice-to-text method thereof

Technical Field

The invention relates to the technical field of voice recognition, in particular to a multichannel voice recognition device and a voice-to-text method thereof.

Background

In the power dispatching process, the efficiency can be improved by using a voice recognition technology, and dispatching automation and intellectualization are further realized. Besides recognizing the voice command sent by the dispatcher, the telephone call between the dispatcher and the field personnel is required to be recognized, and the call is converted into a text and stored, so that the quick inquiry of the dispatch log can be realized, and further, the more advanced intelligent operation ticket function can be realized. Therefore, the power dispatching system needs to realize three voice recognition functions, namely dispatcher instruction recognition, dispatcher call recognition and on-site personnel call recognition. Of course, it is not possible for a dispatcher to issue voice commands and make calls at the same time. The method commonly used at present is to realize the recognition of a voice source by using a microphone on a computer, and switch voice instructions and call recognition functions by a switch, and the scheme has obvious defects:

1. the three sound sources are input by microphones, and the microphones are inconvenient to switch by a switch;

2. Because only one voice recognition device and microphone are provided, in order to recognize telephone conversation, only hands-free play can be used, recognition is input through the microphone, the environmental noise is too large, and the recognition rate is seriously affected;

3. because only one voice recognition device is provided, a dispatcher and a field person have to ensure that when one person speaks, the other person is silent, otherwise, the voice recognition cannot be performed;

The scheme is also a voice recognition scheme of the intelligent sound box at present, and is obviously not suitable for a power dispatching system.

In addition, an important technical problem is to be solved. The power dispatching generally involves a large number of power professional terms and special place names, lines, power stations, specific numbered equipment names and even personal names of all power departments, and due to the existence of a large number of homonyms in languages, the current voice recognition technology often recognizes the frequently uncommon special vocabularies into other common ordinary vocabularies, so that the error rate is high, and the requirements of the power profession are difficult to meet. Therefore, the main reason for the above problems is that the current speech recognition technology is based on frequency priority matching, and when the speech is changed into pinyin, the speech will be preferentially matched with the common vocabulary and popular vocabulary which have higher frequency at ordinary times.

Some documents propose adding special vocabulary, but there are still 3 problems:

1. The special vocabulary of the local power departments cannot appear in a general vocabulary library, the power departments in each area are required to be manually established according to the needs of the power departments, a dispatcher is required to screen the special vocabulary one by one from a large number of local text data, in addition, the special vocabulary is required to be continuously updated, replaced and counted to refresh the frequency, time and labor are wasted, and the dispatching departments are heavy in ordinary times, heavy in tasks and difficult to separate.

2. The special vocabulary and the universal popular vocabulary are added into the word stock together, the special vocabulary has low occurrence frequency, and the popular vocabulary can still be prioritized under the condition of the same pinyin.

3. The existing matching is to start from the initial of a pinyin string and start to convert the pinyin string into characters one by one, because of noise interference and homophones, some pinyin of the later important special vocabulary and the previous pinyin can be combined in advance and converted into other phrases, and errors are caused. For example, the knife switch of the XXX line is closed, and the result is that other lines are closed, which has serious consequences. In contrast, errors in common vocabulary are relatively acceptable.

Disclosure of Invention

The first object of the present invention is to overcome the drawbacks and disadvantages of the prior art, and to provide a multi-channel speech recognition device, which can accurately recognize power dispatching speech instructions, dispatcher and field personnel talking speech, and is critical to how to implement recognition of three speech sources with a small cost, and further to solve recognition of special vocabulary in power dispatching, especially special vocabulary of local power departments, such as local unique place names, lines, power stations, specific equipment names, and even personal names.

A second object of the present invention is to provide a voice-to-text method for a multi-channel voice recognition device.

For normalization, the related nouns of the invention are defined as follows, wherein the vocabulary refers to Chinese phrases, one Chinese phrase comprises at least 2 Chinese characters, the local special vocabulary refers to the vocabulary only used in the local, local area network, a specific region, group or department, the local special vocabulary and the professional term are collectively called special vocabulary, the other term is called universal vocabulary, the word frequency refers to the frequency of occurrence of one vocabulary, the difference frequency refers to the frequency difference of the vocabulary, and the matching is to calculate the similarity between a part of pinyin of a pinyin string A and the correct pinyin of a certain Chinese phrase or word, and is also called the matching of the pinyin and the word or word for short in the invention.

The first object of the invention is realized by the following technical scheme that the multichannel voice recognition device comprises a telephone monitor, a sound card, a voice function switching unit, a first voice-to-word unit and a second voice-to-word unit;

The telephone monitor and the telephone of the dispatcher are connected in parallel with the same telephone line to acquire 2 paths of analog voice signals of the communication between the dispatcher and the field personnel;

The sound card comprises a first Line in interface, a second Line in interface and a Mic interface, wherein the three input interfaces respectively and correspondingly receive three analog voice signals of dispatcher call voice, field personnel call voice and dispatcher voice instruction, the three analog voice signals are converted into digital signals through an analog/digital circuit of the sound card, the digital signals of the dispatcher call voice and the digital signals of the dispatcher voice instruction are output to a voice function switching unit, and the digital signals of the field personnel call voice are output to a second voice to word conversion unit;

The voice function switching unit is responsible for switching between two digital signals when the dispatcher calls or the dispatcher instructs voice, so that the first voice-to-text unit only recognizes one digital signal at the same time;

The first voice-to-text unit receives a digital signal of a dispatcher call or a digital signal of a dispatcher voice instruction and recognizes the digital signal as corresponding text;

The second voice-to-text unit receives digital signals of the voice of the field personnel and recognizes the digital signals as corresponding text.

Preferably, the sound card, the voice function switching unit, the first voice word conversion unit and the second voice word conversion unit are all built in the same computer, and the first voice word conversion unit and the second voice word conversion unit are respectively realized by two cores of a CPU of the computer in parallel.

Preferably, the multi-channel voice recognition device further comprises:

The differential frequency special word stock unit is used for storing classified special words and pinyin thereof for the query of two voice-to-word units, so that the matching accuracy of the special words is improved, the level of the words is determined by the difference of two frequencies of the words, namely, the higher the frequency of the words in the special data is, the higher the level of the words is, the lower the frequency of the words in the general data is, the words refer to Chinese phrases, one Chinese phrase comprises at least 2 Chinese characters, the special words comprise local special words and special terms, the local special words refer to words used in a local machine, a local area network, a specific region, a group or department only, the special words with the same level are stored in the same sub-stock, the highest sub-stock is a first-level sub-stock, the words stored in the differential frequency special word stock unit are sequentially from the second level to the lowest sub-stock, and the words stored in the differential frequency special word stock unit are called differential frequency special words or differential frequency words.

Preferably, the word stock unit dedicated to the difference frequency includes: the first, second, third and fourth-level sub-library modules are used for storing first, second, third and fourth-level difference frequency vocabularies and difference frequency values thereof, and vocabularies with higher difference frequency values in the same-level sub-library are more front in sub-library queuing;

the first, second, third and fourth level sub-library modules are obtained and updated by a construction unit comprising:

The system comprises a text data acquisition module, a text data acquisition module and a text data acquisition module, wherein the text data acquisition module is used for acquiring text data comprising local power dispatching logs, work tickets, equipment records and call texts, searching power professional academic articles on a network, and the call texts are obtained by a first voice-to-text unit and a second voice-to-text unit and are continuously provided for the text data acquisition module;

The special word frequency dictionary module is used for cleaning and word segmentation of the collected text data to obtain a word list, and then carrying out special word frequency statistics on the word list and storing the word list, wherein the special word frequency=the repeated times of the word multiplied by the length of the word and the total word number of all the data;

The universal word frequency dictionary module is used for carrying out word segmentation operation on news comprising a daily corpus of people and new waves, a search fox and a net-work website to obtain a vocabulary list, and then carrying out universal word frequency statistics on the vocabulary list and storing the vocabulary list, wherein the universal word frequency = the number of times the word is repeated multiplied by the length of the word/the total word number of all data;

The difference frequency operation module is used for performing difference frequency operation on each vocabulary of the special word frequency dictionary, wherein the difference frequency operation is as follows:

difference frequency value = special word frequency of one word-k x its general word frequency, where k is a fixed coefficient;

The difference frequency distribution module is used for storing 25% of vocabulary with the top ranking of the difference frequency value into the first-level sub-library module, 26% to 50% of vocabulary into the second-level sub-library module, 51% to 75% of vocabulary into the third-level sub-library module, and the other vocabulary with the difference frequency value being greater than 0 into four levels, and the difference frequency value is less than or equal to 0.

Preferably, the first voice-to-text unit and the second voice-to-text unit are the same, and each of the first voice-to-text unit and the second voice-to-text unit comprises the following modules:

The level priority matching module is used for obtaining a pinyin string consisting of letters and tones after the voice is converted into pinyin, setting the name of the pinyin string as A, and preferentially matching the pinyin of the vocabulary stored by the primary sub-library module of the differential frequency special word library unit in the process of changing the name of the pinyin string into the text, and changing part of the pinyin of the A into the text if the pinyin is successfully matched, and considering the next stage until the sub-library module of the last stage is unsuccessful;

And the frequency priority matching module is used for matching the remaining pinyin with the pinyin of the universal vocabulary after the level priority matching module finishes matching, the non-special vocabulary with high frequency in the universal data is preferentially matched, and finally the remaining pinyin is matched with the pinyin of the single Chinese character.

Preferably, the level priority matching module includes:

The reverse word taking module is used for taking the pinyin of the word with the highest difference frequency value from the unmatched words in the first-stage sub-library module, setting the name of the pinyin as B, and if the words in the first-stage sub-library module are matched, extending the pinyin to the next-stage sub-library module;

And the pinyin conversion module at any position searches the substring C similar to B in the A, and if the matching of the B and the C is successful, the C is converted into a corresponding Chinese phrase. If there are multiple substrings similar to B in A, the above operation is repeated, and the substring C can be located at any position of A.

Preferably, the multi-channel voice recognition device further comprises a subject word sharing unit, a first voice word conversion unit and a second voice word conversion unit, wherein the subject word sharing unit is used for extracting subject words in the existing dialogue text of a dispatcher and a field staff and providing the subject words for the first voice word conversion unit and the second voice word conversion unit to inquire so as to improve the follow-up dialogue recognition rate, and the multi-channel voice recognition device comprises the following modules:

The method comprises a main term determining module, a first speech-to-word converting unit and a second speech-to-word converting unit, wherein the main term determining module is used for counting repeated vocabulary and repeated times thereof, if the repeated vocabulary is a difference frequency special vocabulary, the vocabulary is added into a main term queue, otherwise, the vocabulary is omitted, and the first speech-to-word converting unit and the second speech-to-word converting unit convert the characters obtained by the existing dialogue speech;

The topic word queue ordering module is used for setting that n dialogue sentences are identified from the starting of the current voice recognition to the current sentence to be recognized, and the number of the current sentence to be recognized is n+1st sentence, and the topic value of one repeated vocabulary is as follows:

wherein i and j are repeated when the vocabulary is in the ith and j sentences, the ellipses represent other repeated sentences, i, j < n, and G are the levels of the sub-libraries of the difference frequency special word library to which the vocabulary belongs, and the values of the sub-libraries are integers from 1 to 4. And calculating the topic values of all the topic words in the first n sentences, and queuing the topic words from large to small according to the topic values to obtain a topic word queue.

Preferably, the first voice-to-text unit and the second voice-to-text unit further comprise a subject word matching module, wherein the subject word matching module is used for matching the subject word before the level priority matching module, matching A with the subject word queue, starting from the first subject word of the queue, if the matching is successful, part of the pinyin of the A is changed into characters, and the next subject word is not successfully considered until the last subject word of the queue is considered, and the subject word matching module is started only when the telephone conversation between a dispatcher and a field person is identified.

Preferably, the matching is implemented by the following two modules, including:

The phoneme editing distance calculating module is used for calculating the phoneme editing distance, wherein the phoneme editing distance refers to the minimum number of phoneme editing operations required for converting one into the other between two pinyin strings, the phonemes refer to the initials or finals of the pinyin, and the allowable editing operations comprise inserting an initial consonant/finals, deleting an initial consonant/finals, replacing one initial consonant/finals with the other, and calculating the replacement between fuzzy tones for only 0.5 times, wherein the above operations do not contain tones;

And the judging and outputting module is used for outputting the phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold value if the matched vocabulary is the special vocabulary, or outputting a matching failure signal if the matched vocabulary is the general vocabulary.

The second object of the invention is realized by the following technical scheme that a voice-to-text method of a multi-channel voice recognition device comprises the following steps:

s1, converting voice into pinyin, namely analyzing and identifying a digitized signal of the voice and obtaining a whole sentence of pinyin A corresponding to the voice;

S2, judging whether the voice command or the telephone conversation input is performed currently, if the voice command or the telephone conversation input is performed, entering S3, otherwise entering S4;

S3, subject term matching;

S4, level priority matching;

s5, frequency priority matching;

s6, matching the rest pinyin with a single Chinese character to obtain a whole sentence text;

S7, outputting whole sentence characters, and outputting various vocabulary classifications obtained by matching to a subject word sharing unit, a differential frequency special word stock and a general word frequency dictionary so as to refresh a subject word queue, vocabulary frequency, differential frequency value and sequencing;

S8, if voice is continuously input, turning to S1, otherwise, the next step is performed;

s9, ending.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. 3 voice recognition functions required by intelligent power dispatching, namely dispatcher call voice recognition, field personnel call voice recognition and dispatcher voice instruction operation can be realized on a common desktop computer, a plurality of voice recognition devices are not required, and the cost is saved.

2. Noise interference between dispatcher call voice and field personnel call voice is avoided, so that the recognition rate is improved.

3. The three voice recognition functions do not need to be manually frequently switched.

4. The invention can automatically distinguish the universal vocabulary and the special vocabulary in the power dispatching field, especially those of local power departments, so that the manual database establishment of the power departments in each region is not required, and the system stores the special vocabulary into a hierarchical difference frequency special word database and continuously refreshes, updates and replaces the special vocabulary, thereby saving a great deal of time and energy for the dispatcher.

5. The level priority matching method of the invention takes the special vocabulary of power dispatching as the key priority matching, reduces the errors caused by the prior voice recognition method because of the priority of popular universal vocabulary, thereby improving the voice recognition accuracy in power dispatching, reducing the errors of dispatching instruction recognition, reducing power dispatching accidents, improving dispatching efficiency and better realizing dispatching intellectualization.

Drawings

Fig. 1 is a block diagram of a multi-channel speech recognition device.

Fig. 2 is a flow chart of a voice to text process.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Referring to fig. 1, the present embodiment discloses a multi-channel voice recognition apparatus, including:

The system comprises a telephone monitor, a first impedance matching device K1, a second impedance matching device K2, a sound card, a voice function switching unit, a first voice-to-word unit M1, a second voice-to-word unit M2, a word stock unit special for difference frequency, a construction unit and a subject word sharing unit.

The telephone monitor and the telephone of the dispatcher are connected in parallel with the same telephone line, 2 paths of analog voice signals of the dispatcher and the field personnel are obtained, and the signals are respectively and correspondingly output to the first impedance matching device K1 and the second impedance matching device K2.

The impedance of the first impedance matching device K1 and the second impedance matching device K2 can be adjusted, so that the intensity of an input analog voice signal changes to adapt to the signal intensity requirement of a Line in interface of the sound card, and the first impedance matching device K1 and the second impedance matching device K2 respectively output to the first Line in interface 1 and the second Line in interface 2 of the sound card correspondingly. Of course, if the strength of the analog voice signal is just within the range of the sound card, no impedance matching means may be used.

The sound card comprises a first Line in interface 1, a second Line in interface 2 and a Mic interface, wherein the three input interfaces respectively receive three analog signals of dispatcher call voice, field personnel call voice and dispatcher voice instruction, and the three analog signals are converted into 3 digital voice signals through 3 analog/digital circuits of the sound card, wherein the digital signals of the field personnel voice are output to a second voice word conversion unit M2, and the other two digital signals are output to a voice function switching unit.

The voice function switching unit is responsible for switching between two digital voice signals when a dispatcher calls or the dispatcher instructs the voice signals, so that the first voice word conversion unit M1 only recognizes one voice signal at the same time, under normal conditions, the voice function switching unit defaults to switch on an analog/digital circuit after a Mic interface, so that the output instruction voice digital signal is transmitted to the first voice word conversion unit M1, when a special telephone is scheduled to ring, the special ring is received by a microphone as a triggering sound signal, a sound card is input through the Mic interface, the voice function switching unit compares the reserved ring frequency spectrum waveforms, after the special ring is confirmed, the voice function switching unit is switched to the analog/digital circuit after the first Line in interface 1 is switched on, so that the dispatcher calls the voice digital signal to be input to the first voice word conversion unit M1, and when the call is hung up, the call hang-up signal triggers the voice function switching unit, and is switched to the analog/digital circuit after the Mic interface is switched on.

The first voice-to-text unit M1 receives a digital voice signal of a dispatcher call or a digital voice signal of a dispatcher voice command, recognizes the digital voice signal as corresponding words and outputs the words, the words are used as input of the construction unit for updating the difference frequency vocabulary and the difference frequency value, and the words are used as input of the construction subject word sharing unit for extracting a subject word of the dispatcher call with a field personnel telephone.

The second voice-to-text unit M2 receives the digitized signal of the call voice of the field person, recognizes the digitized signal as corresponding words and outputs the words, the words are used as the input of the construction unit for updating the difference frequency vocabulary and the difference frequency value, the words are used as the input of the structure subject word sharing unit for extracting the subject word in the call text of the dispatcher and the field person, and meanwhile, because M1 and M2 share the special word stock unit of the difference frequency and the subject word sharing unit, M1 and M2 are not independent processes but are interacted and complemented to promote together, thereby improving the recognition accuracy. Details of M1 and M2 will be described in further detail below.

The sound card, the voice function switching unit, the first voice word conversion unit M1 and the second voice word conversion unit M2 are all in the same computer and are realized by hardware and software of the computer, and the first voice word conversion unit M1 and the second voice word conversion unit M2 are respectively realized by two cores of a CPU of the computer executing two processes in parallel.

The importance of the special vocabulary in the power dispatching is higher than that of the common vocabulary, so that the recognition rate of the special vocabulary is guaranteed preferentially, a special vocabulary library is built, and further, under the condition that the matching similarity is not great, the special vocabulary of high level is matched preferentially.

The differential frequency special word library unit is used for storing classified special words and pinyin thereof for the query of the two voice-to-word units, so that the matching accuracy of the special words is improved, the level of the words is determined by the differential frequency value of the special words, the words refer to Chinese phrases, one Chinese phrase comprises at least 2 Chinese characters, the special words comprise local special words and special terms, the local special words refer to words only used in a local machine, a local area network, a specific region, a group or a department, the special words with the same level are stored in the same sub-library, the highest sub-library is a first-level sub-library, the sub-libraries with the second level to the fourth level are sequentially used for storing the first-level, the second-level, the third-level and the fourth-level differential frequency words and the differential frequency values of the first-level sub-library, and the words with the higher differential frequency values in the same sub-library are queued before the sub-libraries.

In addition, the device can automatically build a difference frequency special vocabulary library through a program. To automatically distinguish a specific vocabulary from a normal vocabulary, the difference must be utilized. The special words, especially local special words, for example, when the command voice of the user is 'input Yue Tang, yue Gang, 35, 36, namely the special words' Yue Gang, namely the line of the word, are not generally appeared in common news or articles, but are appeared in local power dispatch logs, work tickets, equipment records and local call text records, on the contrary, the above general words such as 'input' are frequently appeared in common articles or web texts, and the word 'switch' is probably appeared in local texts, electric academic articles and news reports, so the patent proposes that the word level is determined by the difference of two frequencies, namely the higher the frequency of the word level is in the special data, the higher the frequency of the word level is, and the lower the word level is in the general data.

The construction unit is used for automatically constructing a word bank special for the difference frequency and updating words and difference frequency values in the word bank special for the difference frequency, and comprises the following steps:

1) The text data acquisition module is used for acquiring text data comprising local power dispatching logs, work tickets, equipment records and call texts, searching power professional academic articles on a network, wherein the call texts are obtained by the first voice text conversion unit M1 and the second voice text conversion unit M2 and are continuously provided for the text data acquisition module;

2) The special word frequency dictionary module is used for cleaning and word segmentation of the collected text data to obtain a word list, and then carrying out special word frequency statistics on the word list and storing the word list, wherein the special word frequency=the repeated times of the word multiplied by the length of the word and the total word number of all the data;

3) The universal word frequency dictionary module is used for carrying out word segmentation operation on news comprising a daily corpus of people and new waves, a search fox and a net-work website to obtain a vocabulary list, and then carrying out universal word frequency statistics on the vocabulary list and storing the vocabulary list, wherein the universal word frequency = the number of times the word is repeated multiplied by the length of the word/the total word number of all data;

4) The difference frequency operation module is used for performing difference frequency operation on each vocabulary of the special word frequency dictionary, wherein the difference frequency operation is as follows:

5) The difference frequency distribution module is used for storing 25% of vocabulary with the top ranking of the difference frequency value into the first-level sub-library module, 26% to 50% of vocabulary into the second-level sub-library module, 51% to 75% of vocabulary into the third-level sub-library module, and the other vocabulary with the difference frequency value being greater than 0 into four levels, and the difference frequency value is less than or equal to 0.

When a dispatcher and a field maintainer carry out voice dialogue communication, larger construction background noise is often generated, so that the accuracy of voice recognition is seriously reduced. In noisy environments, some words may not be well heard, and people often can guess some words which are not well heard from the context of the conversation, but current speech recognition algorithms only consider recognizing single-sentence speech and cannot utilize consistent topic semantics in the conversation context, which is also a weakness of current speech recognition algorithms. A preferred scheme is to add subject word matching before level priority matching, so that the topic of the dialogue is defined, and the recognition rate of the whole dialogue can be improved.

The subject word sharing unit is used for extracting subject words in the existing dialogue texts of the dispatcher and the on-site staff and providing the subject words for the first voice text conversion unit M1 and the second voice text conversion unit M2 to inquire so as to improve the follow-up dialogue recognition rate, and comprises the following modules:

1) The method comprises a main term determining module, a first speech-to-word converting unit M1 and a second speech-to-word converting unit M2, wherein the main term determining module is used for counting repeated vocabulary and repeated times thereof, and adding the vocabulary into a main term queue if the repeated vocabulary is a difference frequency special vocabulary, otherwise, discarding the vocabulary, wherein the repeated vocabulary is obtained by converting the existing dialogue speech;

2) The topic word queue ordering module is used for setting that n dialogue sentences are identified from the starting of the current speech recognition to the current sentence to be recognized, and the number of the current sentence to be recognized is n+1st sentence, and the topic value of one repeated vocabulary is as follows:

Wherein i and j are repeated when the vocabulary is in the ith and j sentences, the ellipses represent other repeated sentences, i, j < n, and G are the levels of the sub-libraries of the difference frequency special word library to which the vocabulary belongs, and the values of the sub-libraries are integers from 1 to 4. Calculating the topic values of all the topic words in the first n sentences, and queuing the topic words from large to small according to the topic values to obtain a topic word queue;

the first voice-to-text unit M1 is identical to the second voice-to-text unit M2, and includes the following modules:

1) The system comprises a main topic word matching module, a class priority matching module and a class priority matching module, wherein the main topic word matching module is used for obtaining a pinyin string consisting of letters and tones after the phonetic transcription, the class priority matching module is used for matching the main topic word with the main topic word in the process of changing the letters into the words by setting the name A, and matching the main topic word with the main topic word queue, and when the main topic word is successfully matched, part of the pinyin of the A becomes the words, and the next main topic word is not successfully matched until the main topic word of the queue is finally considered.

2) The level priority matching module comprises a reverse word taking module, a random position conversion spelling module and a random position conversion spelling module, wherein the reverse word taking module is used for taking the spelling of the word with the highest difference frequency value from the unmatched words in the first-level sub-library module, setting the name as B, and if the words in the first-level sub-library module are matched, the spelling is carried forward to the next-level sub-library module, and the random position conversion spelling module is used for searching the sub-string C similar to the B in the A, if the matching between the B and the C is successful, the C is converted into the corresponding Chinese phrase. If there are multiple substrings similar to B in A, the above operation is repeated, and the substring C can be located at any position of A.

3) And the frequency priority matching module is used for matching the remaining pinyin with the pinyin of the universal vocabulary after the level priority matching module finishes matching, the non-special vocabulary with high frequency in the universal data is preferentially matched, and finally the remaining pinyin is matched with the pinyin of the single Chinese character.

The matching used in the speech-to-text unit is achieved by a matching module, the matching of pinyin and vocabulary and text may be performed according to well known methods, the invention provides a preferable matching scheme which comprises the following two module implementations:

1) The phoneme editing distance calculating module refers to the minimum number of phoneme editing operations needed by converting one into the other between two pinyin strings, wherein the phonemes refer to initial consonants or final sounds of the pinyin, the allowed editing operations comprise inserting an initial consonant/final sound, deleting an initial consonant/final sound, replacing one initial consonant/final sound with the other, and replacing one fuzzy sound for only 0.5 times, for example, assuming that a standard of Mandarin of a speaker is adopted, a Yue Tang station 'yue 4 tang2 zhan 4' speaks 'yue tan2 zhan 4', correct pinyin can be obtained by replacing one final sound ang, and an and ang are fuzzy sounds each other, so that the phoneme editing distance is 0.5.

2) And the judging and outputting module is used for outputting the phoneme editing distance and a matching success signal when the phoneme editing distance is smaller than a given threshold value if the matched vocabulary is the special vocabulary, or outputting a matching failure signal if the matched vocabulary is the general vocabulary.

The tone of pinyin is not considered here, and because of the large number of Chinese dialects, the pronunciation of each place is greatly different, and the tone is difficult to distinguish by many people, and the tone is affected by the change of intonation and mood.

As shown in fig. 2, the method for converting voice into text in the multi-channel voice recognition device according to the present embodiment includes the following steps:

S1, converting voice into pinyin. And analyzing and recognizing the digitized signals of the voice by adopting a well-known deep learning voice recognition algorithm, and obtaining the whole sentence of pinyin corresponding to the voice. For example, when the command voice of the user is 'input Yue Tang station Yue Gang Hunan stone line 35 grounding disconnecting link and 36 grounding disconnecting link', [tou2 ru4 yue4 tang2 zhan4 yue4 gang1 xiang1 shi2 xian4 san1 wu3 jie1 di4 dao1 zha2 he2 san1 liu4 jie1 di4 dao1 zha2], is called Pinyin string A through the conversion of the step S3;

S2, judging the dialogue. Judging whether the current voice command is input or the telephone conversation is input according to the state of the voice function switching unit, if the current voice command is input or the telephone conversation is input between the dispatcher and the field personnel, entering S3, otherwise entering S4;

S3, a subject word matching module, namely inquiring a subject word sharing unit, matching the A with a subject word queue in the subject word sharing unit, starting from the first subject word of the queue, changing part of pinyin of the A into characters if the matching is successful, and considering the next subject word until the last subject word of the queue is unsuccessful;

S4, performing Chinese text matching on the whole sentence pinyin by using a level priority matching module, and querying a difference frequency special vocabulary library. For example, yue Tang stations, yue Gang Hunan stone lines and grounding knife switches are special words, and the difference frequency value is ordered to be Yue Tang stations (1 level) > Yue Gang Hunan stone lines (2 level) > grounding knife switches (3 level). 1) And reversely extracting words, namely extracting words from the first-level sub-library one by one according to the difference frequency value from large to small, and searching whether matched sub-strings exist in the Pinyin string A or not when each word is extracted. The current matching method is to take pinyin from the string A to search the vocabulary library, the method of the patent is opposite to the method, so the method is called reverse word taking, 2) any position conversion is carried out, the current method is to convert characters from the first letter, the method is different, sub-strings can be converted at any position of the string A, if the matching gap is larger than a given threshold value, the sub-strings are abandoned, the latter word is taken again until the Pinyin of the Yue pond station is 'yue 4 tang2 zhan 4', the corresponding part in the Pinyin string A can be matched, and thus the Pinyin string A is changed into [ tou ru4 Yue Tang station yue gang1 input 1 shi2 san 4 san1 wu3 jie1 di4 dao1 zha2 san1 di 2 san 4 di4 jie di4 dao1 di4 zha 2. The reverse word taking and the random position conversion are specially designed for the difference frequency special words, and are different from the currently known method. Similarly, the rest special vocabulary of the string A is converted into [ tou < 2 > ru4 < Yue Tang > station Yue Gang Xiangshi line san1 wu3 grounding knife switch he2 san1 liu4 grounding knife switch ];

S5, the frequency priority matching module matches pinyin and universal vocabulary. When all special vocabularies in the string A are converted, matching the universal vocabularies according to a known frequency priority method, namely taking tou ru4 according to the sequence from front to back, looking up a universal dictionary to obtain 'input', wherein the string A is changed into [ input Yue Tang station Yue Gang Xiangshen line san1 wu3 grounding knife gate he2 san1 liu4 grounding knife gate ];

s6, matching the rest pinyin with a single Chinese character to obtain a whole sentence of text [ input Yue Tang station Yue Gang Hunan stone line 35 grounding switch and 36 grounding switch ];

s7, outputting whole sentence characters, outputting various vocabulary classifications obtained by matching to a subject word sharing module, a differential frequency special word bank and a general word frequency dictionary to refresh a subject word queue, word frequencies, differential frequency values and sequences, for example, refreshing differential frequency vocabulary Yue Tang stations and the differential frequency values of Yue Gang Hunan stone lines and updating the sequences of the differential frequency vocabulary in the differential frequency vocabulary bank, wherein the non-appearing differential frequency vocabulary is not required to be refreshed frequently, if the vocabularies still appear in the previous sentences, the subject word queue is refreshed, and if the previous sentences do not exist, the new queue is added and the final queue is put.

S8, if the voice is continuously input, turning to S1, otherwise, the next step is performed;

s9, ending.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A multi-channel voice recognition device is characterized by comprising a telephone monitor, a sound card, a voice function switching unit, a first voice-to-word unit and a second voice-to-word unit;

2. The multi-channel speech recognition device of claim 1, wherein the sound card, the speech function switching unit, the first speech-to-text unit, and the second speech-to-text unit are all built in the same computer, and the first speech-to-text unit and the second speech-to-text unit are respectively implemented in parallel by two cores of a CPU of the computer.

3. The multi-channel speech recognition device of claim 1, further comprising:

4. The multi-channel speech recognition device according to claim 3, wherein the differential frequency special word stock unit comprises a first-level, a second-level, a third-level and a fourth-level sub-stock module, wherein the first-level, the second-level, the third-level and the fourth-level differential frequency words and the differential frequency values thereof are stored, and words with higher differential frequency values in the same-level sub-stock are queued in front of the sub-stock;

5. The multi-channel speech recognition device of claim 1, wherein the first speech to text unit and the second speech to text unit are identical, each comprising the following modules:

6. The multi-channel speech recognition device of claim 5, wherein the level-first matching module comprises:

And the pinyin conversion module at any position is used for searching the substring C similar to B in the A, if the matching of the B and the C is successful, converting the C into a corresponding Chinese phrase, and if a plurality of substrings similar to B exist in the A, repeating the operation, wherein the substring C can be positioned at any position of the A.

7. The multi-channel speech recognition apparatus according to claim 1, further comprising a subject word sharing unit for extracting subject words in the existing dialogue text of the dispatcher and the on-site staff, and providing the extracted subject words to the first speech-to-text unit and the second speech-to-text unit for query to improve the subsequent dialogue recognition rate, comprising:

The method comprises the steps of i and j, namely repeating the vocabulary in the ith and j sentences, wherein ellipses represent other repeated sentences, i, j < n, G are the levels of sub-libraries of a difference frequency special word library to which the vocabulary belongs, the values of the sub-libraries are integers from 1 to 4, the topic values of all topic words in the first n sentences are calculated, and then the topic word queues are obtained according to the topic values from big to small.

8. The multi-channel speech recognition device of claim 5, further comprising a keyword matching module for matching A with the keyword queue before the level first matching module, wherein when the A is successfully matched, part of the pinyin of the A is changed into characters from the first topic vocabulary of the queue, the next keyword is not successfully matched until the last keyword of the queue, and wherein the module is only started when the dispatcher and the field personnel are identified to have a telephone conversation.

9. The multi-channel speech recognition device of claim 8, wherein the matching is accomplished by two modules comprising:

10. A speech to text method for a multi-channel speech recognition device as claimed in any one of claims 1 to 9, comprising the steps of:

S3, subject term matching;

S4, level priority matching;

s5, frequency priority matching;

s9, ending.