CN114627896B

Movatterモバイル変換

Info

Publication number: CN114627896B
Application number: CN202210325744.5A
Authority: CN
Inventors: 何梦中; 李秀林; 吴本谷
Original assignee: Databaker Beijng Technology Co ltd
Current assignee: Databaker Beijng Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2025-06-13
Anticipated expiration: 2042-03-29
Also published as: CN114627896A

Abstract

The invention provides a voice evaluation method, a voice evaluation device, voice evaluation equipment and a storage medium. The method comprises the steps of obtaining an audio signal to be evaluated, extracting acoustic characteristics of each voice frame in the audio signal, determining the probability that the voice frame of the audio signal is pronounciated as each phoneme in a phoneme dictionary by utilizing the acoustic characteristics to obtain pronunciation information of the voice frame, determining phonemes corresponding to standard text information based on the dialect dictionary, wherein phonemes of words in the dialect dictionary comprise the standard phonemes and the dialect phonemes, aligning the voice frame of the audio signal with the phonemes corresponding to the standard text information based on the acoustic characteristics and the pronunciation information of each voice frame in the audio signal to obtain alignment information of the voice frame, and determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame. Therefore, the application range of the voice evaluation technology can be effectively expanded, various requirements of users can be further met, and the experience of the users is improved.

Description

Voice evaluation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech evaluation method, apparatus, device, and storage medium.

Background

In recent years, with the continuous progress of technology, voice technology is being applied to various industries. For example, in recent two years of online education firebreak, speech evaluation techniques are used by numerous online education platforms and the like to score the pronunciation of a user to judge whether the pronunciation is standard or not.

Most of the current speech evaluation technologies evaluate Chinese mandarin speech, english standard English-American speech and the like. The application range of the voice evaluation technology is limited to a great extent, various requirements of users are difficult to meet, and poor application experience is brought to the users.

Therefore, a new speech evaluation technology is needed to solve the above-mentioned problems.

Disclosure of Invention

The present invention has been made in view of the above-described problems. According to one aspect of the invention, a voice evaluation method is provided, which comprises the steps of obtaining an audio signal to be evaluated, extracting acoustic characteristics of each voice frame in the audio signal, determining the probability that the voice frame of the audio signal is pronounced as each phoneme in a phoneme dictionary by utilizing the acoustic characteristics to obtain pronunciation information of the voice frame, wherein the phoneme dictionary comprises dialect phonemes, obtaining standard text information corresponding to the audio signal to be evaluated, determining phonemes corresponding to the standard text information based on the dialect dictionary, wherein the phonemes of words in the dialect dictionary comprise standard phonemes and dialect phonemes, aligning the voice frame of the audio signal with the phonemes corresponding to the standard text information based on the acoustic characteristics and the pronunciation information of each voice frame in the audio signal to obtain alignment information of the voice frame, and determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame.

The method further comprises the steps of conducting dialect classification on the audio signal to obtain a dialect scaling factor of the audio signal, determining that the dialect scaling factor is a first value for the case that the audio signal is dialect speech, and determining that the dialect scaling factor is a second value for the case that the audio signal is non-dialect speech, wherein the first value is larger than the second value, and wherein the evaluation result of the audio signal relative to the text information is determined according to the dialect scaling factor.

Illustratively, dialect classifying the audio signal to obtain a dialect scaling factor of the audio signal includes determining probabilities of the audio signal being dialect speech or non-dialect speech, respectively, using a classification model, and determining the dialect scaling factor based on the probabilities of the audio signal being dialect speech and/or the probabilities of the audio signal being non-dialect speech.

Illustratively, determining the dialect scaling factor based on the probability that the audio signal is dialect speech and/or the probability that the audio signal is non-dialect speech includes dividing the probability that the audio signal is dialect speech by the probability that the audio signal is non-dialect speech to obtain a quotient, and determining that the quotient is the dialect scaling factor.

The method for determining the evaluation result of the audio signal relative to the text information comprises the steps of determining accuracy A, fluency B and completeness C of an audio sentence in the audio signal according to pronunciation information and alignment information of a voice frame, calculating score S of the audio sentence in the audio signal by using the following formula, wherein S=delta (a is a+b is b+c), delta represents dialect scaling factors, and a, B and C represent weights of the accuracy A, the fluency B and the completeness C respectively, and determining the evaluation result based on the score S of the audio sentence in the audio signal.

The method further comprises extracting acoustic features of each voice frame in the training audio signal, training an acoustic model by using the acoustic features of the voice frames of the training audio signal, determining values of loss functions of the acoustic model based on calculation results of the acoustic model and the dialect identifications, determining probabilities of the voice frames of the audio signal being pronounced as respective phonemes in a phoneme dictionary by using the acoustic features to obtain pronunciation information of the voice frames, and inputting the acoustic features into the acoustic model to output the probabilities of the voice frames of the audio signal being pronounced as the respective phonemes in the phoneme dictionary by the acoustic model.

Illustratively, aligning the speech frames of the audio signal with phonemes corresponding to the text information based on acoustic features and pronunciation information of each speech frame in the audio signal to obtain alignment information of the speech frames includes generating a search space corresponding to standard text information based on phonemes corresponding to standard text information, wherein the search space includes a dialect phoneme path formed by the dialect phonemes, and determining the phonemes corresponding to each speech frame of the audio signal based on the acoustic features and pronunciation information of each speech frame and the search space.

According to another aspect of the present invention, there is also provided a voice evaluating apparatus, including:

the data acquisition module is used for acquiring an audio signal to be evaluated and standard text information corresponding to the audio signal;

The feature extraction module is used for extracting acoustic features of the audio signal;

a calculation module for determining a probability that a speech frame of an audio signal is uttered as each phoneme in a phoneme dictionary by using acoustic features to obtain pronunciation information of the speech frame, wherein the phoneme dictionary comprises dialect phonemes;

the phonemic determining module is used for determining phonemes corresponding to the standard text information based on the dialect dictionary, wherein the phonemes of the words in the dialect dictionary comprise standard phonemes and dialect phonemes;

The alignment module is used for aligning the voice frames of the audio signal with phonemes corresponding to the standard text information based on the acoustic characteristics and pronunciation information of each voice frame in the audio signal so as to obtain alignment information of the voice frames;

And the evaluation result determining module is used for determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame.

According to still another aspect of the present invention, there is further provided a speech evaluation apparatus, including a sound collecting device, an input device, a processor and a memory, where the sound collecting device is configured to obtain an audio signal to be evaluated and send the audio signal to the processor, the input device is configured to input standard text information corresponding to the audio signal to be evaluated and send the standard text information to the processor, and the memory stores computer program instructions, where the computer program instructions are configured to execute the speech evaluation method as described above when the processor runs the computer program instructions.

According to still another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the speech evaluation method as described above when running.

In the technical scheme, the voice evaluation can be performed on the audio signal of the pronunciation of the dialect. Therefore, the application range of the voice evaluation technology can be effectively expanded, various requirements of users can be further met, and the experience of the users is improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following more particular description of embodiments of the present invention, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 shows a schematic flow chart of a speech evaluation method according to one embodiment of the invention;

FIG. 2 shows a schematic flow chart of training an acoustic model for implementing forward computation according to one embodiment of the invention;

FIG. 3 shows a schematic flow chart of aligning a speech frame of an audio signal with a phoneme corresponding to standard text information to obtain alignment information of the speech frame, according to one embodiment of the invention;

FIG. 4a shows a schematic diagram of a search space according to one example of the prior art;

FIG. 4b shows a schematic diagram of a search space according to one embodiment of the invention;

FIG. 5 shows a schematic flow chart diagram of dialect classification of an audio signal to obtain a dialect scaling factor of the audio signal in accordance with one embodiment of the invention;

FIG. 6 shows a schematic flow chart of determining a dialect scaling factor based on a probability that an audio signal is dialect speech and/or a probability that an audio signal is dialect speech, according to one embodiment of the invention;

FIG. 7 shows a schematic flow chart of a speech evaluation method according to yet another embodiment of the invention;

FIG. 8 shows a schematic block diagram of a speech evaluation apparatus according to one embodiment of the invention, and

FIG. 9 shows a schematic block diagram of a speech evaluation device according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.

Due to the difference of regional culture, different languages of different regions can have certain difference in pronunciation, vocabulary or grammar. This difference is particularly pronounced in sound, thereby creating a local dialect. Dialects such as chinese Sichuan, tianjin, etc. are only speech that is traveling in one region. As described above, in the existing speech evaluation technology, evaluation is performed only for mandarin pronunciation of chinese or standard english-american pronunciation of english of the user. However, in the practical application scenario, there is a need for evaluating dialects. For example, for data support class companies, it is desirable to perform speech evaluation on dialect speech to quality check the material. However, the prior art cannot realize the evaluation requirement aiming at the dialect of the user. In order to solve the above technical problems, the present application proposes a new speech evaluation method.

It will be appreciated that in the following, examples are described in terms of Chinese or English as speech, but embodiments of the invention are not limited to application to both languages, and may be applied to any language having dialects, such as German.

FIG. 1 shows a schematic flow chart of a speech evaluation method 100 according to one embodiment of the invention. As shown in fig. 1, the speech evaluation method 100 may include the following steps.

Step S110, an audio signal to be evaluated is obtained.

Illustratively, the subject may speak a voice to be evaluated to the electronic device. The speech to be evaluated may be any speech suitable for speech evaluation. The voice to be evaluated may be received by a sound collecting device (e.g. a microphone) of the electronic device and converted by an analog/digital conversion circuit, so as to convert an analog signal, i.e. the voice to be evaluated, into a digital signal, i.e. an audio signal, which can be recognized and processed by the electronic device. Thus, the audio signal corresponds to the voice to be evaluated which is sent by the testee. Alternatively, the audio signal to be evaluated, which is stored in advance therein, may also be acquired from another device or a storage medium by a data transmission technique.

Step S120, extracting the acoustic feature of each speech frame in the audio signal acquired in step S110.

Preferably, the audio signal may be pre-processed. The preprocessing may include filtering, framing, etc. For example, an audio signal is first filtered and sampled, thereby reducing interference with signals of frequencies other than human voice and/or 50 hz current frequency. In addition, the audio signal may be subjected to framing processing. The framing process refers to an operation of slicing an audio signal into a plurality of small segments, thereby obtaining a plurality of speech frames, wherein each of the sliced small segments is referred to as a speech frame. The audio signal of each frame after the framing process has a characteristic of short-time stationary. Alternatively, the frame length of each speech frame may be set to a reasonable value between 20 milliseconds and 40 milliseconds, such as 25 milliseconds.

Feature extraction operations may be performed on the preprocessed audio signal. For each speech frame, the acoustic features of that speech frame are extracted. The acoustic features may be represented by multi-dimensional vectors including content information corresponding to the speech frames. The acoustic features may include mel-frequency cepstral coefficient (MFCC) features, mel-scale Filter bank (Filter Banks) acoustic features, perceptual linear prediction coefficient (PLP) features, and the like. The method for extracting the acoustic features is not limited in the application, and any existing or future technology capable of extracting the acoustic features is within the protection scope of the application. By way of example and not limitation, code for extracting the corresponding acoustic features may be invoked directly from KALDI open source code.

Step S130, determining the probability of pronunciation of the voice frame of the audio signal as each phoneme in the phoneme dictionary by utilizing the acoustic features to obtain pronunciation information of the voice frame. Wherein the phonemic dictionary includes dialect phonemes.

The probability P (q|o), i.e. the probability that a speech frame of the audio signal pronounces as a respective phoneme in the phoneme dictionary, may be obtained by forward computation of the neural network, for example. Where o represents a phoneme corresponding to an acoustic feature extracted based on one speech frame in the audio signal, and q represents a phoneme in the phoneme dictionary. In the Chinese speech evaluation, the phoneme dictionary can be obtained by modeling the initials and finals as well as the tones. In the english speech evaluation, the phoneme dictionary may be obtained by modeling english phonemes. According to an embodiment of the present application, dialect phones are included in the phone dictionary in addition to standard phones corresponding to, for example, standard chinese pronunciations. Therefore, in the embodiment of the application, dialect phonemes are added in the existing phonemic dictionary so as to realize evaluation of the speech to be evaluated, which is pronounced as a dialect. It will be appreciated that dialect phonemes may differ in pronunciation from standard phonemes.

In a specific embodiment, if the audio signal is divided into 400 speech frames after framing, and the phoneme dictionary comprises 38 phonemes, a 400 x 38 matrix may be obtained by forward calculation of acoustic features for the speech frames. The number of rows of the matrix represents the number of frames of the speech frames in the audio signal, and the number of columns of the matrix represents the number of phonemes in the phoneme dictionary. Each element in the matrix represents a probability that a corresponding speech frame in the audio signal pronounces as a corresponding phoneme. This matrix is thus used to represent the voicing information of the speech frames of the audio signal.

And step S140, obtaining standard text information corresponding to the audio signal to be evaluated.

The standard text information is illustratively the correct text information for the audio signal to be evaluated. Alternatively, the user may input standard text information using an input means (e.g., a keyboard) of the electronic device or acquire pre-stored standard text information from other electronic devices or storage means. It is understood that the execution order between step S140 and the aforementioned steps S110, S120 and S130 may be arbitrary. For example, step S140 may be performed before step S110, may be performed after step S110, and may be performed in the course of performing steps S110 to S130.

Step S150, determining phonemes corresponding to the standard text information based on the dialect dictionary. Wherein the phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes.

It can be understood that the standard text information acquired in step S140 has a corresponding correct phoneme. Based on the dialect dictionary, phonemes corresponding to the standard text information can be determined. The dialect dictionary comprises words and phonemes corresponding to the words. The phonemes of a word include standard phonemes and dialect phonemes. For ease of understanding and description, the following description will be made in the present application taking the example in which the dialect phonemes in the dialect dictionary include the Tianjin phonemes. In one embodiment, the standard text information obtained is "camphor tree". Correspondingly, the standard phonemes of the words in the Tianjin dialect dictionary can determine the phonemes corresponding to the standard text, zh ang_1sh u_4 and z ang_1s u_4.

Step S160, aligning the speech frame of the audio signal with the phoneme corresponding to the standard text information based on the acoustic feature and pronunciation information of each speech frame in the audio signal to obtain the alignment information of the speech frame.

Illustratively, the forced alignment is used to achieve alignment of the speech frames of the audio signal and the corresponding phonemes of the text information. Forced alignment may be achieved using gaussian mixture models (Gaussian Mixture Model, GMM) -hidden markov models (Hidden Markov Model, HMM), long-short-term memory networks (Long Short Term Memory, LSTM) -cascade timing classification (Connectionist Temporal Classification, CTC), convolutional neural networks (Convolutional Neural Networks, CNN), and recurrent neural networks (Recurrent Neural Network, RNN) among other models. For forced alignment, it is known to input phonemes that include acoustic features of the speech frame and pronunciation information and corresponding standard text information. An output vector can be obtained through forced alignment. The length of the output vector is the number of frames of the speech frames of the audio signal. The output vector indicates what the correct phoneme for each speech frame is, i.e., alignment information for the speech frame. Illustratively, 32 standard phones and 20 dialect phones may be included in the dialect dictionary. For each phoneme in the dialect dictionary, including standard phonemes and dialect phonemes, an index number is assigned to it. Each element in the obtained output vector may represent an index number in the dialect dictionary of a phoneme in the standard text information aligned with the frame. For example, the 51 st element in the output vector has a value of 2, the phoneme aligned with the 51 st frame in the audio signal is the phoneme with index number 2 in the dialect dictionary.

Illustratively, forced alignment may be performed by performing an optimal path search on the HCLG diagrams. The grammar weighted finite state machine transcriber (G-gram) in figure HCLG is generated based on the standard text information, and thus always corresponds to the path to which the standard text information corresponds, regardless of the audio signal. Thereby, it is determined that each speech frame in the audio signal should correspond to a phoneme, i.e. forced alignment is achieved.

Step S170, determining the evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame. The more ideal the evaluation result is, the higher the matching degree of the audio signal and the standard text information is.

As described above, the pronunciation information of the speech frame of the audio signal indicates the probability that the pronunciation of the speech frame is each possible phoneme, and the alignment information of the speech frame indicates which phoneme among phonemes corresponding to the standard text information the speech frame is aligned with. Based on the two, the evaluation result of the audio signal can be determined. The greater the probability that the pronunciation of the speech frame is the aligned phonemes, the more ideal the evaluation result is, otherwise, the reverse is true. For example, the evaluation result may be expressed as a specific score. The size of the score represents the degree of matching of the audio signal to be evaluated with the standard text information. The greater the score, the higher the degree of matching of the two.

Illustratively, the dialect phonemes may be provided with dialect identifications. For example, "#dlt" in the phonemes "z ang_1s u_4#dlt" is a dialect identification based on which it can be recognized whether the phonemes are dialect phonemes. FIG. 2 shows a schematic flow diagram of training an acoustic model for implementing forward computation according to one embodiment of the invention. For the aforementioned step S130, determining the probability that a speech frame of an audio signal is uttered as each phoneme in the phoneme dictionary using acoustic features to obtain the utterance information of the speech frame may be implemented using a trained acoustic model. As shown in fig. 2, training the acoustic model for implementing step S130 may include the following steps.

In step S210, acoustic features of each speech frame in the training audio signal are extracted.

The training audio signal may be, for example, any audio signal suitable for acoustic feature extraction training. Similar to step S120, the training audio signal may be preprocessed prior to extracting the acoustic features of each speech frame in the training audio signal. The preprocessing operation is already described in the previous step S120, and is not described here again for brevity. Feature extraction operations may be performed on the preprocessed training audio signal. Likewise, the feature extraction operation is also described in step S120, and is not described here again for brevity.

Step S220, training an acoustic model using acoustic features of a speech frame of the training audio signal. Wherein the value of the loss function of the acoustic model is determined based on the calculation of the acoustic model and the dialect identification.

Illustratively, the probabilities of the speech frames of the training audio signal being pronounced as individual phones in the phone dictionary are determined using the acoustic model, and the values of the loss function of the acoustic model may be determined based on the determined probabilities and dialect identifications. For the case where the training audio signal is a dialect audio signal, the value of the loss function may be decreased if a dialect identification occurs, and vice versa the value of the loss function may be increased.

The acoustic model trained as described above is used for the forward computation, in other words, the step S130 of determining probabilities of pronunciation of the speech frame of the audio signal as each phoneme in the phoneme dictionary using the acoustic features may include inputting the acoustic features into the acoustic model to output the probabilities of pronunciation of the speech frame of the audio signal as each phoneme in the phoneme dictionary by the acoustic model.

It will be appreciated that, taking Chinese as an example, some Chinese characters are polyphones. According to the technical scheme, dialect phonemes are marked by using dialect identifiers. Therefore, the dialect phonemes and the phonemes of the polyphones are effectively distinguished, the training effect of the acoustic model is guaranteed, and further, a more accurate dialect evaluation result can be obtained based on the trained acoustic model in the subsequent voice evaluation process.

Fig. 3 shows a schematic flow chart of aligning a speech frame of an audio signal with a phoneme corresponding to standard text information to obtain alignment information of the speech frame according to an embodiment of the invention. As shown in fig. 3, step S160 may include the following steps.

Step S161, generating a search space corresponding to the standard text information based on the phonemes corresponding to the standard text information, wherein the search space includes a dialect phoneme path formed by the dialect phonemes.

Fig. 4a shows a schematic diagram of a search space according to one example of the prior art. As shown in fig. 4a, the search space is input as standard phonemes and output as words. And, there is only one output path per phoneme. The text information corresponding to the search space shown in fig. 4a is "this is camphor tree", and the corresponding phoneme is zhe_ 4 sh i_4 zh ang_1 sh u_4.

As described above, standard phonemes and dialect phonemes of a word are included in the dialect dictionary. For example, the dialect dictionary includes the celestial phone phonemes. Correspondingly, the Tianjin phone of the camphor tree is zhe_ 4 sh i_4 z ang_1 s u_4. Based on this, phonemes of the text information and corresponding search spaces may be determined from the dialect dictionary. FIG. 4b shows a schematic diagram of a search space according to one embodiment of the application. As shown in fig. 4b, the input to the search space contains standard phones and dialect phones, the output of which is also a word. But unlike the prior art, the search space of this embodiment of the present application includes both a standard phoneme path and a dialect phoneme path.

Step S162, determining a phoneme corresponding to each speech frame based on the acoustic feature and pronunciation information of each speech frame of the audio signal and the search space.

For example, the obtained audio signal is a voice of the Tianjin, and the optimal path, i.e., the dialect phoneme path, may be selected in the search space generated in the above step S161 based on the acoustic characteristics and pronunciation information of each voice frame obtained as described above. Alternatively, referring to fig. 4b, in the dialect phoneme path, the dialect phonemes therein are marked with the dialect identification "#dlt" at the rear thereof. Conversely, if the audio signal obtained is mandarin chinese speech, a standard phoneme path is selected. Based on the selected path, a phoneme corresponding to each speech frame may be determined.

Therefore, through the search space comprising the dialect phoneme path, the correct phonemes corresponding to each voice frame can be determined, and the accuracy of the dialect voice evaluation result is ensured.

Illustratively, the method 100 may further comprise a step S180 of dialect classifying the audio signal to obtain a dialect scaling factor of the audio signal. The dialect scaling factor is determined to be a first value for the case where the audio signal is dialect speech. The dialect scaling factor is determined to be a second value for the case where the audio signal is non-dialect speech. Wherein the first value is greater than the second value. It will be appreciated that in this embodiment, step S180 is performed prior to step S170. Also, in this embodiment, the evaluation result of determining the audio signal with respect to the text information is based not only on the pronunciation information and the alignment information of the speech frame but also on the dialect scale factor.

Therefore, the dialect voice and the non-dialect voice are guaranteed to obtain more reasonable evaluation scores in the voice evaluation process by using the dialect scaling factors. The accuracy of dialect voice evaluation is effectively improved.

Fig. 5 shows a schematic flow chart of step S180 of dialect classification of an audio signal to obtain a dialect scaling factor of the audio signal according to one embodiment of the invention. As shown in fig. 5, step S180 may include the following steps.

In step S181, the probability that the audio signal is dialect speech or non-dialect speech is determined by using the classification model.

Illustratively, the classification model may be a trained LSTM-CNN neural network model. According to the above, the obtained audio signal to be evaluated can be input into the classification model. The classification model may calculate a corresponding probability that the audio signal is dialect speech or non-dialect speech. For example, P1 represents the probability that the audio signal is dialect speech and P2 represents the probability that the audio signal is non-dialect speech.

In step S182, a dialect scaling factor is determined based on the probability of the audio signal being dialect speech and/or the probability of the audio signal being non-dialect speech.

According to the above step S181, a probability P1 that the audio signal is dialect speech and/or a probability P2 that the audio signal is non-dialect speech are obtained. Illustratively, the numerical size of the dialect scaling factor may be determined from the numerical sizes of P1 and/or P2. For example, a numerical relation comparison table of the probability P1 that the audio signal is dialect voice and the dialect scaling factor is preset. When P1 is any value between 70% -80%, correspondingly, a dialect scaling factor of 1.5 can be found in the lookup table. When P1 is any value between 60% and 70%, correspondingly, a dialect scaling factor of 1.2 can be found in the lookup table, and so on. It will be appreciated that the above-described relationships are merely exemplary and are not meant to be limiting with respect to scaling factors.

It can be appreciated that the use of the classification model to determine the dialect scaling factor greatly reduces the workload of the system, improves the computing efficiency, and further improves the overall speed of speech evaluation.

Fig. 6 shows a schematic flow chart of step S182 of determining a dialect scaling factor based on the probability of the audio signal being dialect speech and/or the probability of the audio signal being dialect speech, according to one embodiment of the invention. As shown in fig. 6, step S182 may include the following steps.

In step S182a, the probability of being dialect speech based on the audio signal is divided by the probability of being non-dialect speech to obtain a quotient.

According to the above step S181, the probability P1 that the audio signal is dialect speech and the probability P2 that the audio signal is non-dialect speech can be obtained. Dividing P1 by P2 may obtain a quotient.

In step S182b, the quotient is determined as the dialect scaling factor.

For example, the quotient value may be determined directly as a dialect scaling factor. Wherein if the audio signal is dialect speech, i.e. P1> P2, the value of the dialect scaling factor is larger than 1. Otherwise the value of the dialect scaling factor is less than or equal to 1.

The algorithm of the technical scheme is simple and easy to realize. And the calculated amount is small, and the calculation load of the system is reduced. Most importantly, the technical scheme comprehensively considers the probability that the audio signal is the dialect voice and the probability that the audio signal is the non-dialect voice to obtain the dialect scaling factor, and the dialect voice evaluation can be accurately performed by utilizing the dialect scaling factor.

Illustratively, the determining of the evaluation result of the audio signal with respect to the text information at step S170 may include the following steps.

First, the accuracy a, fluency B, and completeness C of an audio sentence in an audio signal are determined according to pronunciation information and alignment information of a speech frame.

The accuracy of phonemes, words and sentences corresponding to the standard text information can be determined in turn according to the probability of the corresponding phoneme of each speech frame in the pronunciation information and the correct phoneme aligned with the speech frame in the alignment information. Accuracy indicates whether the phoneme, word or sentence in the audio signal is accurately pronounced. Specifically, the index number of the correct phoneme with which the speech frame is aligned may be determined according to the alignment information. For example, the audio signal is divided into 20 frames in total. The phoneme of the standard text information corresponding to the audio signal is zhe_ 4 sh i_4 z ang_1 s u_4#dlt. According to the index number in the alignment information, frames 1-5 are aligned with the phoneme "zh". Correspondingly, the probability that the speech frame pronounces as a phoneme "zh" can be found in the pronunciation information of the speech frames. Multiplying this probability by 100 may yield a percentile score. The accuracy of the phoneme "zh" can be determined based on the score. The respective scores of frames 1-5 may be averaged as the accuracy of the phoneme "zh". It will be appreciated that the respective accuracies of the plurality of phonemes zhe 4 sh i_4 z ang_1 s u_4#dlt may be obtained in accordance with the above-described method. Averaging the accuracy scores of the phonemes corresponding to a word may obtain the accuracy score of the word, for example for "camphor tree". Further, the accuracy score of a sentence may be obtained by averaging the accuracy scores of a plurality of words corresponding to the sentence. Wherein a corresponding accuracy threshold is set for each phoneme, word or sentence. Taking the accuracy threshold value of 80 as an example, when the accuracy score exceeds 80, the corresponding phonemes, words or sentences can be considered to be correctly pronounced.

Fluency may be expressed as how many correctly pronounced words are read per second, which represents the fluency of the audio signal, and is related to the speed of speech, and the number of pauses. Optionally, a fluency threshold is set for fluency, e.g., 3 words/second. When the fluency does not reach the fluency threshold, the fluency score may be reduced appropriately. The degree of reduction is not limited in the present application.

Integrity may represent the score of several words in a sentence exceeding a threshold, which represents the degree of integrity of the audio signal reading standard text information. This threshold includes an accuracy threshold and a fluency threshold. Preferably, words in a sentence are considered to be qualified words when their score exceeds both an accuracy threshold and a fluency threshold. And further judging that a plurality of qualified words exist in one sentence, and determining the integrity score according to the ratio of the number of the qualified words to the number of all words in the sentence. The relationship between the magnitude of the comparison value and the integrity score is not limited in the present application.

Then, the score S of the audio sentence in the audio signal is calculated using the following formula, s=δ (a+b+c). Where δ represents the dialect scaling factor and a, B and C represent the weights of accuracy a, fluency B and completeness C, respectively. Alternatively, a, b, and c may be parameters obtained through machine learning, or may be set appropriately empirically. Specifically, according to the above formula, the obtained three scores are multiplied by their corresponding weights respectively and added to obtain an initial score. And multiplying the initial score by a dialect scaling factor, and scaling the initial score to a corresponding degree according to the numerical value of the dialect scaling factor. And obtaining the score S of the audio sentence after scaling.

Finally, an evaluation result is determined based on the score S of the audio sentence in the audio signal.

Alternatively, the score S of the audio sentence may be directly regarded as the evaluation result. Alternatively, the score S of the audio sentence may also be classified by rank, e.g., the score belongs to the difference between 0-60, the score belongs to the good between 61-80, and the score belongs to the good between 81-100. And taking the classified results, namely the excellent, good and bad results, as evaluation results.

According to the scheme, the final evaluation result can be obtained by integrating the scores of the three aspects, and the reliability of the evaluation result is ensured. In addition, the evaluation result is added with the consideration of dialect scaling factors, so that the accuracy of the evaluation result is further ensured.

Fig. 7 shows a schematic flow chart of a speech evaluation method according to a further embodiment of the invention. As shown in fig. 7, first, an audio signal to be evaluated and standard text information corresponding to the audio signal may be acquired simultaneously or in a time-sharing manner. Feature extraction is performed on the audio signal to obtain acoustic features for each speech frame. The probabilities of the respective phonemes in the phoneme dictionary for the phonetic frame pronunciation of the audio signal can be determined using the extracted acoustic features. This probability can be represented by a matrix and is referred to as pronunciation information. For standard text information, its corresponding phonemes may be determined based on a dialect dictionary. Wherein the phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes. Then, phonemes of the speech frames of the audio signal are aligned with phonemes corresponding to the standard text information based on the acoustic features and pronunciation information of each speech frame. Thereby obtaining alignment information of the speech frames. Wherein the alignment information includes information of which phoneme of the phonemes to which the standard text information corresponds to the speech frame of the audio signal. In addition, the audio signal may also be dialect classified to obtain a dialect scaling factor. The evaluation result of the audio signal relative to the standard text information can be determined according to pronunciation information, alignment information and dialect scaling factor of the voice frame of the audio signal.

According to another aspect of the present invention, a speech evaluation apparatus is provided. Fig. 8 shows a schematic block diagram of a speech evaluation apparatus 800 according to an embodiment of the invention. As shown in fig. 8, the speech evaluation apparatus 800 may include the following modules.

The data acquisition module 810 is configured to acquire an audio signal to be evaluated and standard text information corresponding to the audio signal.

The feature extraction module 820 is configured to extract an acoustic feature of each speech frame of the audio signal.

A calculation module 830, configured to determine probabilities that a speech frame of the audio signal is pronounced as each phoneme in a phoneme dictionary using acoustic features, so as to obtain pronunciation information of the speech frame, where the phoneme dictionary includes dialect phonemes.

The phoneme determining module 840 is configured to determine phonemes corresponding to the standard text information based on the dialect dictionary, where phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes.

An alignment module 850, configured to align, based on the acoustic feature and pronunciation information of each speech frame in the audio signal, the speech frame of the audio signal with a phoneme corresponding to the standard text information, so as to obtain alignment information of the speech frame.

The evaluation result determining module 860 is configured to determine an evaluation result of the audio signal with respect to the standard text information according to the pronunciation information and the alignment information of the speech frame.

It should be noted that, each component of the apparatus should be understood as a functional module established by implementing each step of the program flow or each step of the method, and each functional module is not limited by actual functional division or separation. The means defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution mainly by means of the computer program described in the specification, and should not be understood as physical means for implementing the solution mainly by means of hardware.

According to still another aspect of the present invention, a speech evaluation apparatus is also provided. Fig. 9 shows a schematic block diagram of a speech evaluation device 900 according to one embodiment of the invention. As shown in fig. 9, the speech evaluation apparatus 900 may include a sound collection device 910, an input device 920, a processor 930, and a memory 940. The sound collection device 910 is configured to obtain an audio signal to be evaluated, and send the audio signal to the processor 930. The input device 920 is configured to input standard text information corresponding to the audio signal to be evaluated, and send the standard text information to the processor 930. The memory 940 has stored therein computer program instructions that, when executed by the processor, are adapted to carry out the speech evaluation method as described hereinbefore.

According to still another aspect of the present invention, there is also provided a storage medium. Program instructions are stored on a storage medium, which when executed by a computer or processor, cause the computer or processor to perform the respective steps of the speech evaluation method of the embodiments of the present invention and for implementing the respective modules in the speech wake-up device and apparatus according to the embodiments of the present invention. The storage medium may include, for example, a storage component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

It will be appreciated that those skilled in the art can understand the specific embodiments of the speech evaluation device, the speech evaluation apparatus and the storage medium and the advantages thereof by reading the above description of the speech evaluation method, and the details are not repeated herein for brevity.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely exemplary and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the modules in a speech evaluation device according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A speech evaluation method, comprising:

acquiring an audio signal to be evaluated;

extracting acoustic features of each speech frame in the audio signal;

Determining the probability of pronunciation of a speech frame of the audio signal into each phoneme in a phoneme dictionary by utilizing the acoustic features so as to obtain pronunciation information of the speech frame, wherein the phoneme dictionary comprises dialect phonemes;

obtaining standard text information corresponding to an audio signal to be evaluated;

Determining phonemes corresponding to the standard text information based on a dialect dictionary, wherein the phonemes of words in the dialect dictionary comprise standard phonemes and dialect phonemes;

Aligning the speech frames of the audio signal with phonemes corresponding to the standard text information based on the acoustic features and pronunciation information of each speech frame in the audio signal to obtain alignment information of the speech frames, and

Determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame;

the method further comprises the steps of:

Performing dialect classification on the audio signal to obtain a dialect scaling factor of the audio signal, determining the dialect scaling factor to be a first value for the case where the audio signal is dialect speech, and determining the dialect scaling factor to be a second value for the case where the audio signal is non-dialect speech, wherein the first value is greater than the second value;

Wherein the determining the evaluation result of the audio signal relative to the text information is further based on the dialect scaling factor.

2. The method of claim 1, wherein said dialect classifying the audio signal to obtain a dialect scaling factor for the audio signal comprises:

Determining the probability that the audio signal is dialect speech or non-dialect speech respectively by using a classification model;

The dialect scaling factor is determined based on a probability that the audio signal is dialect speech and/or a probability that the audio signal is non-dialect speech.

3. The method of claim 2, wherein the determining the dialect scaling factor based on the probability that the audio signal is dialect speech and/or the probability that the audio signal is non-dialect speech comprises:

Dividing a probability of being dialect speech based on the audio signal by a probability of being non-dialect speech to obtain a quotient;

and determining the quotient as the dialect scaling factor.

4. A method according to any one of claims 1 to 3, wherein said determining an evaluation of said audio signal relative to said text information comprises:

determining the accuracy A, the fluency B and the completeness C of an audio sentence in the audio signal according to the pronunciation information and the alignment information of a voice frame;

Calculating a score S of an audio sentence in the audio signal, s=δ (a+b+b+c), wherein δ represents the dialect scaling factor, a, B and C represent the weights of accuracy a, fluency B and completeness C, respectively, using the formula

The evaluation result is determined based on a score S of an audio sentence in the audio signal.

5. The method of claim 1, wherein the dialect phonemes are provided with dialect identifications,

The method further comprises the steps of:

Extracting acoustic features of each voice frame in the training audio signal;

the method for training the acoustic model by using the acoustic characteristics of the voice frame of the audio signal, wherein the method for training the acoustic model by using the acoustic characteristics of the voice frame of the audio signal to obtain pronunciation information of the voice frame comprises the following steps:

the acoustic features are input to the acoustic model to output probabilities of speech frames of the audio signal being pronounced as individual phones in a phone dictionary by the acoustic model.

6. The method of claim 1, wherein the aligning the speech frames of the audio signal with phonemes corresponding to the text information based on the acoustic features and pronunciation information of each speech frame in the audio signal to obtain the alignment information of the speech frames comprises:

Generating a search space corresponding to the standard text information based on phonemes corresponding to the standard text information, wherein the search space comprises a dialect phoneme path formed by dialect phonemes;

And determining a phoneme corresponding to each voice frame based on the acoustic characteristics and pronunciation information of each voice frame of the audio signal and the search space.

7. A speech evaluation device comprising:

A calculation module, configured to determine probabilities that a speech frame of the audio signal pronounces as each phoneme in a phoneme dictionary by using the acoustic features, so as to obtain pronunciation information of the speech frame, where the phoneme dictionary includes dialect phonemes;

a phoneme determining module, configured to determine, based on a dialect dictionary, a phoneme corresponding to the standard text information, where phonemes of words in the dialect dictionary include standard phonemes and dialect phonemes;

an alignment module, configured to align, based on acoustic features and pronunciation information of each speech frame in the audio signal, the speech frame of the audio signal with a phoneme corresponding to the standard text information, so as to obtain alignment information of the speech frame;

the evaluation result determining module is used for determining an evaluation result of the audio signal relative to the standard text information according to the pronunciation information and the alignment information of the voice frame;

the apparatus is further for dialect classifying the audio signal to obtain a dialect scaling factor for the audio signal, determining the dialect scaling factor as a first value for a case where the audio signal is dialect speech, and determining the dialect scaling factor as a second value for a case where the audio signal is non-dialect speech, wherein the first value is greater than the second value;

Wherein the evaluation result determination module determines that the evaluation result of the audio signal relative to the text information is also according to the dialect scaling factor.

8. A voice evaluation device comprises a voice acquisition device, an input device, a processor and a memory, wherein,

The sound acquisition device is used for acquiring an audio signal to be evaluated and sending the audio signal to the processor;

the input device is used for inputting standard text information corresponding to the audio signal to be evaluated and sending the standard text information to the processor;

Stored in the memory are computer program instructions which, when executed by the processor, are adapted to carry out the speech evaluation method according to any one of claims 1 to 6.

9. A storage medium having stored thereon program instructions for performing the speech evaluation method according to any one of claims 1 to 6 when run.