Movatterモバイル変換


[0]ホーム

URL:


CN114627896B - Voice evaluation method, device, equipment and storage medium - Google Patents

Voice evaluation method, device, equipment and storage medium
Download PDF

Info

Publication number
CN114627896B
CN114627896BCN202210325744.5ACN202210325744ACN114627896BCN 114627896 BCN114627896 BCN 114627896BCN 202210325744 ACN202210325744 ACN 202210325744ACN 114627896 BCN114627896 BCN 114627896B
Authority
CN
China
Prior art keywords
audio signal
dialect
speech
phonemes
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210325744.5A
Other languages
Chinese (zh)
Other versions
CN114627896A (en
Inventor
何梦中
李秀林
吴本谷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Databaker Beijng Technology Co ltd
Original Assignee
Databaker Beijng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Databaker Beijng Technology Co ltdfiledCriticalDatabaker Beijng Technology Co ltd
Priority to CN202210325744.5ApriorityCriticalpatent/CN114627896B/en
Publication of CN114627896ApublicationCriticalpatent/CN114627896A/en
Application grantedgrantedCritical
Publication of CN114627896BpublicationCriticalpatent/CN114627896B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention provides a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium. The method comprises the steps of obtaining an audio signal to be evaluated, extracting acoustic characteristics of each voice frame in the audio signal, determining the probability that the voice frame of the audio signal is pronounciated as each phoneme in a phoneme dictionary by utilizing the acoustic characteristics to obtain pronunciation information of the voice frame, determining phonemes corresponding to standard text information based on the dialect dictionary, wherein phonemes of words in the dialect dictionary comprise the standard phonemes and the dialect phonemes, aligning the voice frame of the audio signal with the phonemes corresponding to the standard text information based on the acoustic characteristics and the pronunciation information of each voice frame in the audio signal to obtain alignment information of the voice frame, and determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame. Therefore, the application range of the voice evaluation technology can be effectively expanded, various requirements of users can be further met, and the experience of the users is improved.

Description

Voice evaluation method, device, equipment and storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, device, and storage medium.
Background
In recent years, with the continuous progress of technology, voice technology is being applied to various industries. For example, in recent two years of online education firebreak, speech evaluation techniques are used by numerous online education platforms and the like to score the pronunciation of a user to judge whether the pronunciation is standard or not.
Most of the current speech evaluation technologies evaluate Chinese mandarin speech, english standard English-American speech and the like. The application range of the voice evaluation technology is limited to a great extent, various requirements of users are difficult to meet, and poor application experience is brought to the users.
Therefore, a new speech evaluation technology is needed to solve the above-mentioned problems.
Disclosure of Invention
The present invention has been made in view of the above-described problems. According to one aspect of the invention, a voice evaluation method is provided, which comprises the steps of obtaining an audio signal to be evaluated, extracting acoustic characteristics of each voice frame in the audio signal, determining the probability that the voice frame of the audio signal is pronounced as each phoneme in a phoneme dictionary by utilizing the acoustic characteristics to obtain pronunciation information of the voice frame, wherein the phoneme dictionary comprises dialect phonemes, obtaining standard text information corresponding to the audio signal to be evaluated, determining phonemes corresponding to the standard text information based on the dialect dictionary, wherein the phonemes of words in the dialect dictionary comprise standard phonemes and dialect phonemes, aligning the voice frame of the audio signal with the phonemes corresponding to the standard text information based on the acoustic characteristics and the pronunciation information of each voice frame in the audio signal to obtain alignment information of the voice frame, and determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame.
The method further comprises the steps of conducting dialect classification on the audio signal to obtain a dialect scaling factor of the audio signal, determining that the dialect scaling factor is a first value for the case that the audio signal is dialect speech, and determining that the dialect scaling factor is a second value for the case that the audio signal is non-dialect speech, wherein the first value is larger than the second value, and wherein the evaluation result of the audio signal relative to the text information is determined according to the dialect scaling factor.
Illustratively, dialect classifying the audio signal to obtain a dialect scaling factor of the audio signal includes determining probabilities of the audio signal being dialect speech or non-dialect speech, respectively, using a classification model, and determining the dialect scaling factor based on the probabilities of the audio signal being dialect speech and/or the probabilities of the audio signal being non-dialect speech.
Illustratively, determining the dialect scaling factor based on the probability that the audio signal is dialect speech and/or the probability that the audio signal is non-dialect speech includes dividing the probability that the audio signal is dialect speech by the probability that the audio signal is non-dialect speech to obtain a quotient, and determining that the quotient is the dialect scaling factor.
The method for determining the evaluation result of the audio signal relative to the text information comprises the steps of determining accuracy A, fluency B and completeness C of an audio sentence in the audio signal according to pronunciation information and alignment information of a voice frame, calculating score S of the audio sentence in the audio signal by using the following formula, wherein S=delta (a is a+b is b+c), delta represents dialect scaling factors, and a, B and C represent weights of the accuracy A, the fluency B and the completeness C respectively, and determining the evaluation result based on the score S of the audio sentence in the audio signal.
The method further comprises extracting acoustic features of each voice frame in the training audio signal, training an acoustic model by using the acoustic features of the voice frames of the training audio signal, determining values of loss functions of the acoustic model based on calculation results of the acoustic model and the dialect identifications, determining probabilities of the voice frames of the audio signal being pronounced as respective phonemes in a phoneme dictionary by using the acoustic features to obtain pronunciation information of the voice frames, and inputting the acoustic features into the acoustic model to output the probabilities of the voice frames of the audio signal being pronounced as the respective phonemes in the phoneme dictionary by the acoustic model.
Illustratively, aligning the speech frames of the audio signal with phonemes corresponding to the text information based on acoustic features and pronunciation information of each speech frame in the audio signal to obtain alignment information of the speech frames includes generating a search space corresponding to standard text information based on phonemes corresponding to standard text information, wherein the search space includes a dialect phoneme path formed by the dialect phonemes, and determining the phonemes corresponding to each speech frame of the audio signal based on the acoustic features and pronunciation information of each speech frame and the search space.
According to another aspect of the present invention, there is also provided a voice evaluating apparatus, including:
the data acquisition module is used for acquiring an audio signal to be evaluated and standard text information corresponding to the audio signal;
The feature extraction module is used for extracting acoustic features of the audio signal;
a calculation module for determining a probability that a speech frame of an audio signal is uttered as each phoneme in a phoneme dictionary by using acoustic features to obtain pronunciation information of the speech frame, wherein the phoneme dictionary comprises dialect phonemes;
the phonemic determining module is used for determining phonemes corresponding to the standard text information based on the dialect dictionary, wherein the phonemes of the words in the dialect dictionary comprise standard phonemes and dialect phonemes;
The alignment module is used for aligning the voice frames of the audio signal with phonemes corresponding to the standard text information based on the acoustic characteristics and pronunciation information of each voice frame in the audio signal so as to obtain alignment information of the voice frames;
And the evaluation result determining module is used for determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame.
According to still another aspect of the present invention, there is further provided a speech evaluation apparatus, including a sound collecting device, an input device, a processor and a memory, where the sound collecting device is configured to obtain an audio signal to be evaluated and send the audio signal to the processor, the input device is configured to input standard text information corresponding to the audio signal to be evaluated and send the standard text information to the processor, and the memory stores computer program instructions, where the computer program instructions are configured to execute the speech evaluation method as described above when the processor runs the computer program instructions.
According to still another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the speech evaluation method as described above when running.
In the technical scheme, the voice evaluation can be performed on the audio signal of the pronunciation of the dialect. Therefore, the application range of the voice evaluation technology can be effectively expanded, various requirements of users can be further met, and the experience of the users is improved.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following more particular description of embodiments of the present invention, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.
FIG. 1 shows a schematic flow chart of a speech evaluation method according to one embodiment of the invention;
FIG. 2 shows a schematic flow chart of training an acoustic model for implementing forward computation according to one embodiment of the invention;
FIG. 3 shows a schematic flow chart of aligning a speech frame of an audio signal with a phoneme corresponding to standard text information to obtain alignment information of the speech frame, according to one embodiment of the invention;
FIG. 4a shows a schematic diagram of a search space according to one example of the prior art;
FIG. 4b shows a schematic diagram of a search space according to one embodiment of the invention;
FIG. 5 shows a schematic flow chart diagram of dialect classification of an audio signal to obtain a dialect scaling factor of the audio signal in accordance with one embodiment of the invention;
FIG. 6 shows a schematic flow chart of determining a dialect scaling factor based on a probability that an audio signal is dialect speech and/or a probability that an audio signal is dialect speech, according to one embodiment of the invention;
FIG. 7 shows a schematic flow chart of a speech evaluation method according to yet another embodiment of the invention;
FIG. 8 shows a schematic block diagram of a speech evaluation apparatus according to one embodiment of the invention, and
FIG. 9 shows a schematic block diagram of a speech evaluation device according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.
Due to the difference of regional culture, different languages of different regions can have certain difference in pronunciation, vocabulary or grammar. This difference is particularly pronounced in sound, thereby creating a local dialect. Dialects such as chinese Sichuan, tianjin, etc. are only speech that is traveling in one region. As described above, in the existing speech evaluation technology, evaluation is performed only for mandarin pronunciation of chinese or standard english-american pronunciation of english of the user. However, in the practical application scenario, there is a need for evaluating dialects. For example, for data support class companies, it is desirable to perform speech evaluation on dialect speech to quality check the material. However, the prior art cannot realize the evaluation requirement aiming at the dialect of the user. In order to solve the above technical problems, the present application proposes a new speech evaluation method.
It will be appreciated that in the following, examples are described in terms of Chinese or English as speech, but embodiments of the invention are not limited to application to both languages, and may be applied to any language having dialects, such as German.
FIG. 1 shows a schematic flow chart of a speech evaluation method 100 according to one embodiment of the invention. As shown in fig. 1, the speech evaluation method 100 may include the following steps.
Step S110, an audio signal to be evaluated is obtained.
Illustratively, the subject may speak a voice to be evaluated to the electronic device. The speech to be evaluated may be any speech suitable for speech evaluation. The voice to be evaluated may be received by a sound collecting device (e.g. a microphone) of the electronic device and converted by an analog/digital conversion circuit, so as to convert an analog signal, i.e. the voice to be evaluated, into a digital signal, i.e. an audio signal, which can be recognized and processed by the electronic device. Thus, the audio signal corresponds to the voice to be evaluated which is sent by the testee. Alternatively, the audio signal to be evaluated, which is stored in advance therein, may also be acquired from another device or a storage medium by a data transmission technique.
Step S120, extracting the acoustic feature of each speech frame in the audio signal acquired in step S110.
Preferably, the audio signal may be pre-processed. The preprocessing may include filtering, framing, etc. For example, an audio signal is first filtered and sampled, thereby reducing interference with signals of frequencies other than human voice and/or 50 hz current frequency. In addition, the audio signal may be subjected to framing processing. The framing process refers to an operation of slicing an audio signal into a plurality of small segments, thereby obtaining a plurality of speech frames, wherein each of the sliced small segments is referred to as a speech frame. The audio signal of each frame after the framing process has a characteristic of short-time stationary. Alternatively, the frame length of each speech frame may be set to a reasonable value between 20 milliseconds and 40 milliseconds, such as 25 milliseconds.
Feature extraction operations may be performed on the preprocessed audio signal. For each speech frame, the acoustic features of that speech frame are extracted. The acoustic features may be represented by multi-dimensional vectors including content information corresponding to the speech frames. The acoustic features may include mel-frequency cepstral coefficient (MFCC) features, mel-scale Filter bank (Filter Banks) acoustic features, perceptual linear prediction coefficient (PLP) features, and the like. The method for extracting the acoustic features is not limited in the application, and any existing or future technology capable of extracting the acoustic features is within the protection scope of the application. By way of example and not limitation, code for extracting the corresponding acoustic features may be invoked directly from KALDI open source code.
Step S130, determining the probability of pronunciation of the voice frame of the audio signal as each phoneme in the phoneme dictionary by utilizing the acoustic features to obtain pronunciation information of the voice frame. Wherein the phonemic dictionary includes dialect phonemes.
The probability P (q|o), i.e. the probability that a speech frame of the audio signal pronounces as a respective phoneme in the phoneme dictionary, may be obtained by forward computation of the neural network, for example. Where o represents a phoneme corresponding to an acoustic feature extracted based on one speech frame in the audio signal, and q represents a phoneme in the phoneme dictionary. In the Chinese speech evaluation, the phoneme dictionary can be obtained by modeling the initials and finals as well as the tones. In the english speech evaluation, the phoneme dictionary may be obtained by modeling english phonemes. According to an embodiment of the present application, dialect phones are included in the phone dictionary in addition to standard phones corresponding to, for example, standard chinese pronunciations. Therefore, in the embodiment of the application, dialect phonemes are added in the existing phonemic dictionary so as to realize evaluation of the speech to be evaluated, which is pronounced as a dialect. It will be appreciated that dialect phonemes may differ in pronunciation from standard phonemes.
In a specific embodiment, if the audio signal is divided into 400 speech frames after framing, and the phoneme dictionary comprises 38 phonemes, a 400 x 38 matrix may be obtained by forward calculation of acoustic features for the speech frames. The number of rows of the matrix represents the number of frames of the speech frames in the audio signal, and the number of columns of the matrix represents the number of phonemes in the phoneme dictionary. Each element in the matrix represents a probability that a corresponding speech frame in the audio signal pronounces as a corresponding phoneme. This matrix is thus used to represent the voicing information of the speech frames of the audio signal.
And step S140, obtaining standard text information corresponding to the audio signal to be evaluated.
The standard text information is illustratively the correct text information for the audio signal to be evaluated. Alternatively, the user may input standard text information using an input means (e.g., a keyboard) of the electronic device or acquire pre-stored standard text information from other electronic devices or storage means. It is understood that the execution order between step S140 and the aforementioned steps S110, S120 and S130 may be arbitrary. For example, step S140 may be performed before step S110, may be performed after step S110, and may be performed in the course of performing steps S110 to S130.
Step S150, determining phonemes corresponding to the standard text information based on the dialect dictionary. Wherein the phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes.
It can be understood that the standard text information acquired in step S140 has a corresponding correct phoneme. Based on the dialect dictionary, phonemes corresponding to the standard text information can be determined. The dialect dictionary comprises words and phonemes corresponding to the words. The phonemes of a word include standard phonemes and dialect phonemes. For ease of understanding and description, the following description will be made in the present application taking the example in which the dialect phonemes in the dialect dictionary include the Tianjin phonemes. In one embodiment, the standard text information obtained is "camphor tree". Correspondingly, the standard phonemes of the words in the Tianjin dialect dictionary can determine the phonemes corresponding to the standard text, zh ang_1sh u_4 and z ang_1s u_4.
Step S160, aligning the speech frame of the audio signal with the phoneme corresponding to the standard text information based on the acoustic feature and pronunciation information of each speech frame in the audio signal to obtain the alignment information of the speech frame.
Illustratively, the forced alignment is used to achieve alignment of the speech frames of the audio signal and the corresponding phonemes of the text information. Forced alignment may be achieved using gaussian mixture models (Gaussian Mixture Model, GMM) -hidden markov models (Hidden Markov Model, HMM), long-short-term memory networks (Long Short Term Memory, LSTM) -cascade timing classification (Connectionist Temporal Classification, CTC), convolutional neural networks (Convolutional Neural Networks, CNN), and recurrent neural networks (Recurrent Neural Network, RNN) among other models. For forced alignment, it is known to input phonemes that include acoustic features of the speech frame and pronunciation information and corresponding standard text information. An output vector can be obtained through forced alignment. The length of the output vector is the number of frames of the speech frames of the audio signal. The output vector indicates what the correct phoneme for each speech frame is, i.e., alignment information for the speech frame. Illustratively, 32 standard phones and 20 dialect phones may be included in the dialect dictionary. For each phoneme in the dialect dictionary, including standard phonemes and dialect phonemes, an index number is assigned to it. Each element in the obtained output vector may represent an index number in the dialect dictionary of a phoneme in the standard text information aligned with the frame. For example, the 51 st element in the output vector has a value of 2, the phoneme aligned with the 51 st frame in the audio signal is the phoneme with index number 2 in the dialect dictionary.
Illustratively, forced alignment may be performed by performing an optimal path search on the HCLG diagrams. The grammar weighted finite state machine transcriber (G-gram) in figure HCLG is generated based on the standard text information, and thus always corresponds to the path to which the standard text information corresponds, regardless of the audio signal. Thereby, it is determined that each speech frame in the audio signal should correspond to a phoneme, i.e. forced alignment is achieved.
Step S170, determining the evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame. The more ideal the evaluation result is, the higher the matching degree of the audio signal and the standard text information is.
As described above, the pronunciation information of the speech frame of the audio signal indicates the probability that the pronunciation of the speech frame is each possible phoneme, and the alignment information of the speech frame indicates which phoneme among phonemes corresponding to the standard text information the speech frame is aligned with. Based on the two, the evaluation result of the audio signal can be determined. The greater the probability that the pronunciation of the speech frame is the aligned phonemes, the more ideal the evaluation result is, otherwise, the reverse is true. For example, the evaluation result may be expressed as a specific score. The size of the score represents the degree of matching of the audio signal to be evaluated with the standard text information. The greater the score, the higher the degree of matching of the two.
In the technical scheme, the voice evaluation can be performed on the audio signal of the pronunciation of the dialect. Therefore, the application range of the voice evaluation technology can be effectively expanded, various requirements of users can be further met, and the experience of the users is improved.
Illustratively, the dialect phonemes may be provided with dialect identifications. For example, "#dlt" in the phonemes "z ang_1s u_4#dlt" is a dialect identification based on which it can be recognized whether the phonemes are dialect phonemes. FIG. 2 shows a schematic flow diagram of training an acoustic model for implementing forward computation according to one embodiment of the invention. For the aforementioned step S130, determining the probability that a speech frame of an audio signal is uttered as each phoneme in the phoneme dictionary using acoustic features to obtain the utterance information of the speech frame may be implemented using a trained acoustic model. As shown in fig. 2, training the acoustic model for implementing step S130 may include the following steps.
In step S210, acoustic features of each speech frame in the training audio signal are extracted.
The training audio signal may be, for example, any audio signal suitable for acoustic feature extraction training. Similar to step S120, the training audio signal may be preprocessed prior to extracting the acoustic features of each speech frame in the training audio signal. The preprocessing operation is already described in the previous step S120, and is not described here again for brevity. Feature extraction operations may be performed on the preprocessed training audio signal. Likewise, the feature extraction operation is also described in step S120, and is not described here again for brevity.
Step S220, training an acoustic model using acoustic features of a speech frame of the training audio signal. Wherein the value of the loss function of the acoustic model is determined based on the calculation of the acoustic model and the dialect identification.
Illustratively, the probabilities of the speech frames of the training audio signal being pronounced as individual phones in the phone dictionary are determined using the acoustic model, and the values of the loss function of the acoustic model may be determined based on the determined probabilities and dialect identifications. For the case where the training audio signal is a dialect audio signal, the value of the loss function may be decreased if a dialect identification occurs, and vice versa the value of the loss function may be increased.
The acoustic model trained as described above is used for the forward computation, in other words, the step S130 of determining probabilities of pronunciation of the speech frame of the audio signal as each phoneme in the phoneme dictionary using the acoustic features may include inputting the acoustic features into the acoustic model to output the probabilities of pronunciation of the speech frame of the audio signal as each phoneme in the phoneme dictionary by the acoustic model.
It will be appreciated that, taking Chinese as an example, some Chinese characters are polyphones. According to the technical scheme, dialect phonemes are marked by using dialect identifiers. Therefore, the dialect phonemes and the phonemes of the polyphones are effectively distinguished, the training effect of the acoustic model is guaranteed, and further, a more accurate dialect evaluation result can be obtained based on the trained acoustic model in the subsequent voice evaluation process.
Fig. 3 shows a schematic flow chart of aligning a speech frame of an audio signal with a phoneme corresponding to standard text information to obtain alignment information of the speech frame according to an embodiment of the invention. As shown in fig. 3, step S160 may include the following steps.
Step S161, generating a search space corresponding to the standard text information based on the phonemes corresponding to the standard text information, wherein the search space includes a dialect phoneme path formed by the dialect phonemes.
Fig. 4a shows a schematic diagram of a search space according to one example of the prior art. As shown in fig. 4a, the search space is input as standard phonemes and output as words. And, there is only one output path per phoneme. The text information corresponding to the search space shown in fig. 4a is "this is camphor tree", and the corresponding phoneme is zhe_ 4 sh i_4 zh ang_1 sh u_4.
As described above, standard phonemes and dialect phonemes of a word are included in the dialect dictionary. For example, the dialect dictionary includes the celestial phone phonemes. Correspondingly, the Tianjin phone of the camphor tree is zhe_ 4 sh i_4 z ang_1 s u_4. Based on this, phonemes of the text information and corresponding search spaces may be determined from the dialect dictionary. FIG. 4b shows a schematic diagram of a search space according to one embodiment of the application. As shown in fig. 4b, the input to the search space contains standard phones and dialect phones, the output of which is also a word. But unlike the prior art, the search space of this embodiment of the present application includes both a standard phoneme path and a dialect phoneme path.
Step S162, determining a phoneme corresponding to each speech frame based on the acoustic feature and pronunciation information of each speech frame of the audio signal and the search space.
For example, the obtained audio signal is a voice of the Tianjin, and the optimal path, i.e., the dialect phoneme path, may be selected in the search space generated in the above step S161 based on the acoustic characteristics and pronunciation information of each voice frame obtained as described above. Alternatively, referring to fig. 4b, in the dialect phoneme path, the dialect phonemes therein are marked with the dialect identification "#dlt" at the rear thereof. Conversely, if the audio signal obtained is mandarin chinese speech, a standard phoneme path is selected. Based on the selected path, a phoneme corresponding to each speech frame may be determined.
Therefore, through the search space comprising the dialect phoneme path, the correct phonemes corresponding to each voice frame can be determined, and the accuracy of the dialect voice evaluation result is ensured.
Illustratively, the method 100 may further comprise a step S180 of dialect classifying the audio signal to obtain a dialect scaling factor of the audio signal. The dialect scaling factor is determined to be a first value for the case where the audio signal is dialect speech. The dialect scaling factor is determined to be a second value for the case where the audio signal is non-dialect speech. Wherein the first value is greater than the second value. It will be appreciated that in this embodiment, step S180 is performed prior to step S170. Also, in this embodiment, the evaluation result of determining the audio signal with respect to the text information is based not only on the pronunciation information and the alignment information of the speech frame but also on the dialect scale factor.
Illustratively, the dialect scaling factor may be calculated using a dialect classification module. The input of the dialect classification module is an audio signal, and the output is a dialect scaling factor. The evaluation result is a specific evaluation score as an example. The dialect scaling factor may be applied as a scaling factor to the evaluation score to enable further calculation of the evaluation score. Specifically, when the audio signal is a dialect voice, the dialect scaling factor output by the dialect classification calculation module is a first value. For example, the first value may be any reasonable value greater than 1. The dialect scaling factor of the first value is then applied to the evaluation score, which can be amplified since it is greater than 1. In other words, when the audio signal is a dialect speech, the evaluation score may be amplified with a corresponding dialect scaling factor to increase its evaluation score. Otherwise, when the audio signal is non-dialect voice, the dialect scaling factor output by the dialect classification calculating module is a second value. The second value is less than the first value, which may be any reasonable value less than or equal to 1. When the dialect scaling factor is the second value, the evaluation score may be scaled down. That is, when the audio signal is non-dialect speech, the evaluation score may be scaled down or left unchanged with the corresponding dialect scaling factor.
Therefore, the dialect voice and the non-dialect voice are guaranteed to obtain more reasonable evaluation scores in the voice evaluation process by using the dialect scaling factors. The accuracy of dialect voice evaluation is effectively improved.
Fig. 5 shows a schematic flow chart of step S180 of dialect classification of an audio signal to obtain a dialect scaling factor of the audio signal according to one embodiment of the invention. As shown in fig. 5, step S180 may include the following steps.
In step S181, the probability that the audio signal is dialect speech or non-dialect speech is determined by using the classification model.
Illustratively, the classification model may be a trained LSTM-CNN neural network model. According to the above, the obtained audio signal to be evaluated can be input into the classification model. The classification model may calculate a corresponding probability that the audio signal is dialect speech or non-dialect speech. For example, P1 represents the probability that the audio signal is dialect speech and P2 represents the probability that the audio signal is non-dialect speech.
In step S182, a dialect scaling factor is determined based on the probability of the audio signal being dialect speech and/or the probability of the audio signal being non-dialect speech.
According to the above step S181, a probability P1 that the audio signal is dialect speech and/or a probability P2 that the audio signal is non-dialect speech are obtained. Illustratively, the numerical size of the dialect scaling factor may be determined from the numerical sizes of P1 and/or P2. For example, a numerical relation comparison table of the probability P1 that the audio signal is dialect voice and the dialect scaling factor is preset. When P1 is any value between 70% -80%, correspondingly, a dialect scaling factor of 1.5 can be found in the lookup table. When P1 is any value between 60% and 70%, correspondingly, a dialect scaling factor of 1.2 can be found in the lookup table, and so on. It will be appreciated that the above-described relationships are merely exemplary and are not meant to be limiting with respect to scaling factors.
It can be appreciated that the use of the classification model to determine the dialect scaling factor greatly reduces the workload of the system, improves the computing efficiency, and further improves the overall speed of speech evaluation.
Fig. 6 shows a schematic flow chart of step S182 of determining a dialect scaling factor based on the probability of the audio signal being dialect speech and/or the probability of the audio signal being dialect speech, according to one embodiment of the invention. As shown in fig. 6, step S182 may include the following steps.
In step S182a, the probability of being dialect speech based on the audio signal is divided by the probability of being non-dialect speech to obtain a quotient.
According to the above step S181, the probability P1 that the audio signal is dialect speech and the probability P2 that the audio signal is non-dialect speech can be obtained. Dividing P1 by P2 may obtain a quotient.
In step S182b, the quotient is determined as the dialect scaling factor.
For example, the quotient value may be determined directly as a dialect scaling factor. Wherein if the audio signal is dialect speech, i.e. P1> P2, the value of the dialect scaling factor is larger than 1. Otherwise the value of the dialect scaling factor is less than or equal to 1.
The algorithm of the technical scheme is simple and easy to realize. And the calculated amount is small, and the calculation load of the system is reduced. Most importantly, the technical scheme comprehensively considers the probability that the audio signal is the dialect voice and the probability that the audio signal is the non-dialect voice to obtain the dialect scaling factor, and the dialect voice evaluation can be accurately performed by utilizing the dialect scaling factor.
Illustratively, the determining of the evaluation result of the audio signal with respect to the text information at step S170 may include the following steps.
First, the accuracy a, fluency B, and completeness C of an audio sentence in an audio signal are determined according to pronunciation information and alignment information of a speech frame.
The accuracy of phonemes, words and sentences corresponding to the standard text information can be determined in turn according to the probability of the corresponding phoneme of each speech frame in the pronunciation information and the correct phoneme aligned with the speech frame in the alignment information. Accuracy indicates whether the phoneme, word or sentence in the audio signal is accurately pronounced. Specifically, the index number of the correct phoneme with which the speech frame is aligned may be determined according to the alignment information. For example, the audio signal is divided into 20 frames in total. The phoneme of the standard text information corresponding to the audio signal is zhe_ 4 sh i_4 z ang_1 s u_4#dlt. According to the index number in the alignment information, frames 1-5 are aligned with the phoneme "zh". Correspondingly, the probability that the speech frame pronounces as a phoneme "zh" can be found in the pronunciation information of the speech frames. Multiplying this probability by 100 may yield a percentile score. The accuracy of the phoneme "zh" can be determined based on the score. The respective scores of frames 1-5 may be averaged as the accuracy of the phoneme "zh". It will be appreciated that the respective accuracies of the plurality of phonemes zhe 4 sh i_4 z ang_1 s u_4#dlt may be obtained in accordance with the above-described method. Averaging the accuracy scores of the phonemes corresponding to a word may obtain the accuracy score of the word, for example for "camphor tree". Further, the accuracy score of a sentence may be obtained by averaging the accuracy scores of a plurality of words corresponding to the sentence. Wherein a corresponding accuracy threshold is set for each phoneme, word or sentence. Taking the accuracy threshold value of 80 as an example, when the accuracy score exceeds 80, the corresponding phonemes, words or sentences can be considered to be correctly pronounced.
Fluency may be expressed as how many correctly pronounced words are read per second, which represents the fluency of the audio signal, and is related to the speed of speech, and the number of pauses. Optionally, a fluency threshold is set for fluency, e.g., 3 words/second. When the fluency does not reach the fluency threshold, the fluency score may be reduced appropriately. The degree of reduction is not limited in the present application.
Integrity may represent the score of several words in a sentence exceeding a threshold, which represents the degree of integrity of the audio signal reading standard text information. This threshold includes an accuracy threshold and a fluency threshold. Preferably, words in a sentence are considered to be qualified words when their score exceeds both an accuracy threshold and a fluency threshold. And further judging that a plurality of qualified words exist in one sentence, and determining the integrity score according to the ratio of the number of the qualified words to the number of all words in the sentence. The relationship between the magnitude of the comparison value and the integrity score is not limited in the present application.
Then, the score S of the audio sentence in the audio signal is calculated using the following formula, s=δ (a+b+c). Where δ represents the dialect scaling factor and a, B and C represent the weights of accuracy a, fluency B and completeness C, respectively. Alternatively, a, b, and c may be parameters obtained through machine learning, or may be set appropriately empirically. Specifically, according to the above formula, the obtained three scores are multiplied by their corresponding weights respectively and added to obtain an initial score. And multiplying the initial score by a dialect scaling factor, and scaling the initial score to a corresponding degree according to the numerical value of the dialect scaling factor. And obtaining the score S of the audio sentence after scaling.
Finally, an evaluation result is determined based on the score S of the audio sentence in the audio signal.
Alternatively, the score S of the audio sentence may be directly regarded as the evaluation result. Alternatively, the score S of the audio sentence may also be classified by rank, e.g., the score belongs to the difference between 0-60, the score belongs to the good between 61-80, and the score belongs to the good between 81-100. And taking the classified results, namely the excellent, good and bad results, as evaluation results.
According to the scheme, the final evaluation result can be obtained by integrating the scores of the three aspects, and the reliability of the evaluation result is ensured. In addition, the evaluation result is added with the consideration of dialect scaling factors, so that the accuracy of the evaluation result is further ensured.
Fig. 7 shows a schematic flow chart of a speech evaluation method according to a further embodiment of the invention. As shown in fig. 7, first, an audio signal to be evaluated and standard text information corresponding to the audio signal may be acquired simultaneously or in a time-sharing manner. Feature extraction is performed on the audio signal to obtain acoustic features for each speech frame. The probabilities of the respective phonemes in the phoneme dictionary for the phonetic frame pronunciation of the audio signal can be determined using the extracted acoustic features. This probability can be represented by a matrix and is referred to as pronunciation information. For standard text information, its corresponding phonemes may be determined based on a dialect dictionary. Wherein the phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes. Then, phonemes of the speech frames of the audio signal are aligned with phonemes corresponding to the standard text information based on the acoustic features and pronunciation information of each speech frame. Thereby obtaining alignment information of the speech frames. Wherein the alignment information includes information of which phoneme of the phonemes to which the standard text information corresponds to the speech frame of the audio signal. In addition, the audio signal may also be dialect classified to obtain a dialect scaling factor. The evaluation result of the audio signal relative to the standard text information can be determined according to pronunciation information, alignment information and dialect scaling factor of the voice frame of the audio signal.
According to another aspect of the present invention, a speech evaluation apparatus is provided. Fig. 8 shows a schematic block diagram of a speech evaluation apparatus 800 according to an embodiment of the invention. As shown in fig. 8, the speech evaluation apparatus 800 may include the following modules.
The data acquisition module 810 is configured to acquire an audio signal to be evaluated and standard text information corresponding to the audio signal.
The feature extraction module 820 is configured to extract an acoustic feature of each speech frame of the audio signal.
A calculation module 830, configured to determine probabilities that a speech frame of the audio signal is pronounced as each phoneme in a phoneme dictionary using acoustic features, so as to obtain pronunciation information of the speech frame, where the phoneme dictionary includes dialect phonemes.
The phoneme determining module 840 is configured to determine phonemes corresponding to the standard text information based on the dialect dictionary, where phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes.
An alignment module 850, configured to align, based on the acoustic feature and pronunciation information of each speech frame in the audio signal, the speech frame of the audio signal with a phoneme corresponding to the standard text information, so as to obtain alignment information of the speech frame.
The evaluation result determining module 860 is configured to determine an evaluation result of the audio signal with respect to the standard text information according to the pronunciation information and the alignment information of the speech frame.
It should be noted that, each component of the apparatus should be understood as a functional module established by implementing each step of the program flow or each step of the method, and each functional module is not limited by actual functional division or separation. The means defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution mainly by means of the computer program described in the specification, and should not be understood as physical means for implementing the solution mainly by means of hardware.
According to still another aspect of the present invention, a speech evaluation apparatus is also provided. Fig. 9 shows a schematic block diagram of a speech evaluation device 900 according to one embodiment of the invention. As shown in fig. 9, the speech evaluation apparatus 900 may include a sound collection device 910, an input device 920, a processor 930, and a memory 940. The sound collection device 910 is configured to obtain an audio signal to be evaluated, and send the audio signal to the processor 930. The input device 920 is configured to input standard text information corresponding to the audio signal to be evaluated, and send the standard text information to the processor 930. The memory 940 has stored therein computer program instructions that, when executed by the processor, are adapted to carry out the speech evaluation method as described hereinbefore.
According to still another aspect of the present invention, there is also provided a storage medium. Program instructions are stored on a storage medium, which when executed by a computer or processor, cause the computer or processor to perform the respective steps of the speech evaluation method of the embodiments of the present invention and for implementing the respective modules in the speech wake-up device and apparatus according to the embodiments of the present invention. The storage medium may include, for example, a storage component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.
It will be appreciated that those skilled in the art can understand the specific embodiments of the speech evaluation device, the speech evaluation apparatus and the storage medium and the advantages thereof by reading the above description of the speech evaluation method, and the details are not repeated herein for brevity.
Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely exemplary and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the modules in a speech evaluation device according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims (9)

CN202210325744.5A2022-03-292022-03-29 Voice evaluation method, device, equipment and storage mediumActiveCN114627896B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210325744.5ACN114627896B (en)2022-03-292022-03-29 Voice evaluation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210325744.5ACN114627896B (en)2022-03-292022-03-29 Voice evaluation method, device, equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN114627896A CN114627896A (en)2022-06-14
CN114627896Btrue CN114627896B (en)2025-06-13

Family

ID=81904597

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210325744.5AActiveCN114627896B (en)2022-03-292022-03-29 Voice evaluation method, device, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN114627896B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115273897B (en)*2022-08-052025-03-07北京有竹居网络技术有限公司 Method, device, equipment and storage medium for processing voice data
CN115359808B (en)*2022-08-222025-08-05北京有竹居网络技术有限公司 Method for processing voice data, model generation method, device, and electronic device
CN119832928B (en)*2025-02-242025-10-03平安科技(深圳)有限公司Phoneme alignment method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109410914A (en)*2018-08-282019-03-01江西师范大学A kind of Jiangxi dialect phonetic and dialect point recognition methods
CN114203201A (en)*2021-12-162022-03-18深圳前海微众银行股份有限公司 Oral evaluation method, apparatus, equipment, storage medium and program product

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10943601B2 (en)*2017-05-312021-03-09Lenovo (Singapore) Pte. Ltd.Provide output associated with a dialect
CN109545244A (en)*2019-01-292019-03-29北京猎户星空科技有限公司Speech evaluating method, device, electronic equipment and storage medium
KR102442020B1 (en)*2019-11-152022-09-08한국전자통신연구원 Automatic fluency assessment method and device for speaking
CN111915940A (en)*2020-06-292020-11-10厦门快商通科技股份有限公司Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109410914A (en)*2018-08-282019-03-01江西师范大学A kind of Jiangxi dialect phonetic and dialect point recognition methods
CN114203201A (en)*2021-12-162022-03-18深圳前海微众银行股份有限公司 Oral evaluation method, apparatus, equipment, storage medium and program product

Also Published As

Publication numberPublication date
CN114627896A (en)2022-06-14

Similar Documents

PublicationPublication DateTitle
Li et al.Spoken language recognition: from fundamentals to practice
CN114627896B (en) Voice evaluation method, device, equipment and storage medium
CN111402862B (en)Speech recognition method, device, storage medium and equipment
CN112017648A (en)Weighted finite state converter construction method, speech recognition method and device
Shah et al.Effectiveness of PLP-based phonetic segmentation for speech synthesis
CN109300339A (en)A kind of exercising method and system of Oral English Practice
CN111312216B (en)Voice marking method containing multiple speakers and computer readable storage medium
Farooq et al.Mispronunciation detection in articulation points of Arabic letters using machine learning
Bhatt et al.Effects of the dynamic and energy based feature extraction on hindi speech recognition
Yousfi et al.Holy Qur'an speech recognition system Imaalah checking rule for warsh recitation
KR101145440B1 (en)A method and system for estimating foreign language speaking using speech recognition technique
CN115019775A (en)Phoneme-based language identification method for language distinguishing characteristics
Kumari et al.Automatic segmentation of Hindi speech into syllable-like units
Rahmatullah et al.Performance evaluation of Indonesian language forced alignment using Montreal forced aligner
Ma et al.Language identification with deep bottleneck features
Pranjol et al.Bengali speech recognition: An overview
Mustafa et al.Developing an HMM-based speech synthesis system for Malay: a comparison of iterative and isolated unit training
Cettolo et al.Automatic detection of semantic boundaries based on acoustic and lexical knowledge.
CN116052655A (en)Audio processing method, device, electronic equipment and readable storage medium
ShuklaKeywords Extraction and Sentiment Analysis using Automatic Speech Recognition
IsmailA survey of language and dialect identification systems
Cenceschi et al.PESInet: Automatic Recognition of Italian Statements, Questions, and Exclamations With Neural Networks.
ZgankCross-lingual speech recognition between languages from the same language family
Ananthakrishna et al.Effect of time-domain windowing on isolated speech recognition system performance
CN113506561B (en)Text pinyin conversion method and device, storage medium and electronic equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
CB02Change of applicant information

Country or region after:China

Address after:Room c601, 6 / F, building C, North Territory B-6, Dongsheng Science Park, Zhongguancun, No. 66, xixiaokou Road, Haidian District, Beijing 100192

Applicant after:Beibei (Qingdao) Technology Co.,Ltd.

Address before:Room c601, 6 / F, building C, North Territory B-6, Dongsheng Science Park, Zhongguancun, No. 66, xixiaokou Road, Haidian District, Beijing 100192

Applicant before:DATABAKER (BEIJNG) TECHNOLOGY Co.,Ltd.

Country or region before:China

CB02Change of applicant information
TA01Transfer of patent application right

Effective date of registration:20250502

Address after:No. 66 Xixiaokou Road, Zhongguancun Dongsheng Science and Technology Park, Haidian District, Beijing 100192 Unit 417, 4th Floor, Building B-1, Northern Territory

Applicant after:DATABAKER (BEIJNG) TECHNOLOGY Co.,Ltd.

Country or region after:China

Address before:Room c601, 6 / F, building C, North Territory B-6, Dongsheng Science Park, Zhongguancun, No. 66, xixiaokou Road, Haidian District, Beijing 100192

Applicant before:Beibei (Qingdao) Technology Co.,Ltd.

Country or region before:China

TA01Transfer of patent application right
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp