Φ e in the above chart represents the EMMA e-book audio and corresponding text set provided by LibriVox, from which we have extracted the corresponding time segment. It should be noted that FA-0 corresponds to the experimental result without any treatment of the Viterbi algorithm. SFA-1 and SFA-2 are experimental results obtained after corresponding modifications to the Viterbi algorithm. The MCD calculation formula is as follows:

MCD = (10 / \ln (10)) * \sqrt{2 * Σ_{l = 1}^{25} {(c_{l}^{s} - c_{l}^{o})}^{2}}

wherein,

and

representing feature vector values of the synthesized audio and the original audio, respectively. From the above experimental data, we can easily find that, under the same training set, by using the sentence segmentation algorithm adopted by the invention, when the test set is selected to be 9 sentences and the duration is 4min, the MCD value is reduced by 0.08 compared with SFA-1, which indicates that the method adopted by the invention has higher accuracy for positioning the sentence boundary and has good improvement on the final synthesis quality, and can be applied to automatic construction of a speech corpus.

HAZ-SAS system performance evaluation and experimental analysis

The above experiments are performed on a standard data set, and because the experimental data provided by the standard data set have relatively clear manual recording and accurate text correspondence, the algorithm provided by the invention is tested for the segmentation performance on a common data set. We have also made the following experiments: the experiment used Chinese news simulcast speech and corresponding text downloaded from the Internet. With a total of 70 paragraphs of speech, 9447 seconds (approximately 2.6 hours), the experiment still used HTK tool as the tool for training HMM and forward-forced alignment algorithm (front-forced alignment technique).

The results of the ZLSS subsystem and the MLSS subsystem on sentence boundary detection performance and the results of the sentences which are output by the complete full-automatic sentence segmentation system and are correctly segmented are respectively given before and after the two sentence segmentation methods are fused. First, table 2 shows the ability of the ZLSS method itself to provide a scalar after one complete iteration.

We define sentence segmentation accuracy:

sentence segmentation accuracy ═ 100% (number of correctly segmented sentences/total number of sentences) ×

TABLE 2ZLSS, by itself, ability to provide a scalar quantity after a complete iteration is performed

Obviously, the ZLSS method alone is difficult to achieve ideal requirements, and is not enough to provide enough accurate labeled data to quickly construct a chinese speech corpus, which is applied to the field of speech synthesis. Next, we have performed sentence boundary detection performance experiments on the MLSS method in this article, and the data is as follows:

TABLE 3 statistical results of classification performance of MLSS on sentence boundaries

For the convenience of analysis, we can make the classification performance of the boundary into a histogram representation based on the above results. It can be easily seen that the classification performance of the system is continuously improved along with the continuous increase of the capability of the labeling system for providing the labeling sets. In addition, under the same training set, the size of the buffer is set to influence the classification performance of the system, and as can be seen from fig. 10, the larger the buffer is, the higher the classification performance is.

Adding MLSS again to 42.2% of labeled data provided by ZLSS, extracting information characteristics of corresponding sentence points, performing collaborative training (Co _ training), performing further iterative classification, and further obtaining correct sentence points. The results are shown in table 4 below:

TABLE 4 iterative Classification results

Therefore, the MLSS adopted by the invention has good classification performance, and the sentence segmentation accuracy is greatly improved on the basis of adding an error detection mechanism. Meanwhile, as can also be seen from the above data, the classification performance of the classifier obviously increases with the increase of the training data. Then, we give a complete full-automatic sentence segmentation system, and after four iterations, the output results are shown in table 5 below:

TABLE 5 full-automatic sentence segmentation System results

From all the experimental data above it follows that:

1) the improved full-automatic sentence segmentation system has good segmentation accuracy, and more importantly, the number of the obtained correct sentences is greatly improved compared with that of the original subsystems in the whole process without manual participation.

2) Meanwhile, in the flat start training algorithm, shorter space voice input can be used for training better HMMs; in the forced alignment process, the Viterbi decoding space is reduced, and the alignment accuracy is improved.

3) The labeling data of the whole system is automatically generated, so that the labeling data can be regarded as label-free (Zero-labeling), the classification efficiency is greatly improved, and the cost is saved.

4) The automatic sentence segmentation algorithm provided by the invention still has some places needing improvement. For example, when Viterbi-enforced alignment is performed on a large corpus for the first time, the overall requirements for computer performance are significantly higher than those provided in the prior art. The reason is that the iterative algorithm used in the present invention performs Viterbi decoding on the entire speech when performing the first iteration. As such, there are relatively high demands on both the performance and memory size of the computer processor.

The invention provides a full-automatic sentence segmentation algorithm based on spectrum Parameters (Mel-Cepstral Parameters) and prosodic Parameters (Prosodicparameters), which fuses Force-aligned under an HMM and a Co-training method in semi-supervised learning, thereby ensuring that manual intervention is not needed and higher segmentation accuracy is achieved in the process of sentence segmentation of space speech. The method can be applied to the rapid automatic construction of the voice corpus.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. A full-automatic segmentation method for long speech is characterized by comprising the following steps:

(1) providing accurate time data for marking periods by a ZLSS method, and corresponding the time data to the input of an MLSS algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis;

(2) extracting corresponding data frame characteristics from an original file by a boundary characteristic extraction program by utilizing the corresponding good time data to prepare for Co _ training classification iteration;

(3) adding the boundary characteristic information of the correct sentence positions extracted from the previous step into a training set of MLSS (Multi-level class service), performing Co _ training, and further classifying to obtain new sentences;

(4) the classification result only gives the starting frame and the ending frame corresponding to the period position, and the corresponding period is further corresponding to a time axis consistent with ZLSS through a conversion program, and then the next step operation is carried out;

(7) the above steps are repeatedly performed.

2. The method for full-automatic segmentation of long speech according to claim 1, wherein a HashMap-based tracking lookup mechanism is employed to uniformly map all found correct sentence time information to an initial time axis for preparation for next iterative classification.

3. The method as claimed in claim 1, wherein the ZLSS method considers the silence of sentence boundary as an independent phoneme sil, first, trains hidden markov models of each phoneme in the speech through HMM-based unsupervised method and Flat-start training algorithm, aligns phoneme sequences of the text with the text of the text through Viterbi forced alignment, then judges whether the sentence segmentation is correct according to the sentence end sign in the text, and finally, judges whether the sentence segmentation is correct through a strict checking mechanism, thereby obtaining a smaller correct boundary label set.

4. The full-automatic segmentation method for long speech according to claim 1, wherein the ZLSS method introduces an iterative algorithm: firstly, segmenting space speech into paragraph speech and sentence speech according to correct and unmistakable sil given by the above checking mechanism; then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.

5. The method according to claim 1, further comprising automatically expanding the set of accurate labels:

6. The method according to claim 5, wherein the classification of V/C/P is: firstly, performing framing processing on original audio data according to a frame of 20ms, wherein the frames have no overlapped part, then calculating the Energy, zero-crossing rate ZCR and Pitch frequency Pitch of each frame of data, and then smoothing the Energy curve and the Pitch frequency curve;

1) calculating and determining threshold values for energy

2) The mean value MZCR and the variance VZCR of the zero crossing rate are calculated, the threshold TZCR of ZCR being defined as:

TZCR＝MZCR+0.005VZCR

if ZCR > TZCR, then FrameType is the Consonant Consonant

Otherwise Energy < 0.005, then FrameType is Pause Pause

In addition, FrameType is Vowel Vowel

7. The method according to claim 6, wherein the duration of the short consonant with variable length is less than two frames.

8. The method according to claim 6, wherein an error detection mechanism is introduced, and the workflow thereof is as follows:

B) for candidate boundaries found by the classifier, calculating the total number AV of vowels based on the classification result of V/C/P, and calculating the number of vowels from the starting point of the classification result of V/C/P to the candidate boundaries and from the candidate boundaries to the end of audio, which are respectively marked AS AP and AS;

C) if either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary.

9. The method according to claim 1, wherein the minimization-labeled sentence segmentation algorithm based on Co _ training is divided into four steps: firstly, carrying out V/C/P classification on audio, then, carrying out feature extraction on data frames, then, adding a classifier for training and classification, and finally, sending a classification result to a checking mechanism to further ensure the correctness of the classification result.

10. The method for full-automatic segmentation of long speech according to claim 1, wherein the labeling result automatically obtained from the ZLSS iteration process is used as the input of MLSS algorithm, the MLSS algorithm expands the automatic labeling result, and then the expanded labeling result is used as the input of ZLSS to continue expanding the effective labeling.