Teaching mode analysis method and systemTechnical Field
The invention belongs to the field of combination of teaching activities and artificial intelligence, and particularly relates to a teaching mode analysis method and system.
Background
With the high-speed development of education informatization, teaching activities and artificial intelligence technology are more and more closely fused, but a teaching evaluation link is still in a traditional artificial labeling and counting stage, an intelligent strategy is lacked, and convenience, effectiveness and objectivity are lacked. Real-time teaching mode analysis can help the lessee-giving teacher to timely think back about teaching behaviors and teaching methods, and summarize and correct problems and deficiencies in the teaching link, so that deep direct and effective teaching activities are implemented, professional literacy development of the teacher is facilitated, and the teaching quality is improved. Under the background of an educational informatization era, teaching analysis and an artificial intelligence technology are combined, the problems existing in the traditional teaching mode analysis method are solved, and meanwhile, the method also meets the aims of improving the professional ability of teachers and promoting the teaching quality.
The teaching mode analysis has very important significance in the teaching evaluation link, although a series of mature teachers and students behavior quantitative analysis methods such as an S-T analysis method are proposed by many researchers at home and abroad at present, the speech interaction behaviors of teachers and students in a classroom, which are required to be distinguished in the data processing stage in the research process, still need to be manually distinguished and marked for classroom teaching audios, and a system tool capable of automatically analyzing the teaching mode is lacked.
In summary, the defects of the existing teaching mode analysis method mainly include the following points:
1) based on the traditional measurement and evaluation method, the teaching mode analysis has richer theoretical basis, but the teaching mode analysis cannot be popularized due to the complexity, subjectivity and consumption of data processing, so that the related authoritative scales and evaluation indexes are fewer, and the research is difficult to break through.
2) The teaching mode analysis has high cost and strong subjectivity, and lacks of objective and automatic data processing models and analysis tools. In the process of traditional teaching mode analysis, comprehensive manual judgment needs to be carried out on classroom audio, and particularly, a large amount of data information needs to be processed, so that the complexity and difficulty of the existing teaching mode analysis are reflected.
3) Traditional teaching behavior analysis system need call API instrument at speech detection cutting stage and realize the conversion of audio frequency to time stamp node JSON file, thereby obtains audio frequency time breakpoint and cuts, and the accuracy of cutting can not be guaranteed to this kind of mode to be in the charge mode in the use, economic saving is not enough and sustainability is not high.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a teaching mode analysis method and a teaching mode analysis system, and aims to solve the problems that the conventional teaching mode analysis method needs to carry out comprehensive manual judgment on classroom audio, particularly needs to carry out processing on a large amount of data information, cannot ensure the accuracy and the economical efficiency of cutting in a voice detection cutting stage and the like.
In order to achieve the above object, in a first aspect, the present invention provides a teaching mode analysis method, including the following steps:
detecting active sounds in classroom audio, and marking the starting time and the ending time of each section of active sound; cutting the teaching audio according to the starting time and the ending time of each section of active audio to obtain a plurality of sections of active audio; the active tone refers to non-silent audio;
extracting different speaker characteristics and the time lengths of different speakers in each section of active audio based on the combined Mel frequency cepstrum characteristic MFCC vector; the combined MFCC vector is obtained by transversely splicing an MFCC, a first-order differential MFCC vector and a second-order differential MFCC vector;
respectively judging the characteristics of different speakers as teacher speaking and student speaking based on a pre-trained universal background model UBM, and determining the corresponding speaking time of the teacher and the speaking time of the students; the pre-trained UBM can fit the characteristics of different speakers, including teachers and students;
judging whether the teaching mode of the classroom is an exercise classroom, a lecture classroom or a mixed classroom according to the proportion of the speaking time of the teacher to the total classroom time; when the ratio of the speaking time of the teacher to the total classroom time is lower than a first threshold, the teaching mode is an exercise classroom; when the ratio of the speaking time of the teacher to the total classroom time is greater than a second threshold, the teaching mode is a lecture classroom; otherwise, the teaching mode is considered as a mixed classroom; the first threshold is less than a second threshold.
In an alternative embodiment, a Gaussian mixture model GMM is used to detect both spoken and non-spoken portions of classroom audio; wherein, the speaking part is active sound, and the non-speaking part is mute.
In an optional embodiment, the distinguishing the characteristics of different speakers from the characteristics of a teacher to speak to a student based on the pre-trained universal background model UBM is as follows:
extracting different speaker characteristics of a plurality of real classroom audios based on the combined MFCC vector, and training a UBM based on the different speaker characteristics of the real classroom audios; the UBM can fit the characteristics of a large number of speakers, the characteristic data of a target speaker is scattered around the Gaussian distribution of the UBM, and each Gaussian distribution in the UBM is shifted to the characteristic data of the target speaker through a MAP adaptive algorithm;
extracting collected teacher voice fragments of a plurality of real classroom audios based on the combined MFCC vector, and training a corresponding teacher GMM model on the basis of UBM; the teacher GMM model is a Gaussian mixture model trained by features extracted from audio of a teacher and used for simulating continuous probability distribution of voice vector features of the teacher;
and scoring the different speaker characteristics through a GMM and UBM self-contained scoring method and a teacher GMM model, and judging that the corresponding speaker characteristic teacher utterances are higher than a preset threshold value if the score is higher than the preset threshold value, otherwise, judging that the speaker characteristic teacher utterances are the student utterances.
In an optional embodiment, the feature data of the target speaker is scattered around the gaussian distribution of UBM, and each gaussian distribution in UBM is shifted to the feature data of the target speaker by a MAP adaptive algorithm, specifically:
calculating a vector set X (X) of characteristic data of the target speaker1,X2,…,XT) Middle ith feature vector XiSimilarity pr (i | x) to ith Gaussian componentt):
Wherein x istFeature vector, ω, representing the target speaker at time tiRepresents the weight corresponding to the ith target speaker feature vector, pi(xt) Representing the probability score of the ith vector in the target speaker voice feature vector sequence relative to each UBM mixed vector, M representing the number of mixed Gaussian vectors, wjDenotes the weight, p, corresponding to the jth hybrid vectorj(xt) Representing the probability scores of each vector in the jth mixed vector sequence in the general background model relative to each UBM mixed vector;
obtaining a mean value E of a new universal background model UBM according to the similarityi(x) And variance Ei(x2) Parameters are as follows:
wherein n isiRepresenting the frame number of the target voice belonging to the ith mixed Gaussian vector;
and fusing the new parameters obtained in the last step with the original parameters of the UBM model to obtain a final GMM Gaussian mixture model of the target speaker:
wherein, a
iωWeight correction factor representing the Gaussian component of the generic background model, a
imMean correction factor, a, representing the Gaussian component of a generic background model
ivVariance correction factor, mu, representing the Gaussian component of the generic background model
iRepresenting the mean of the generic background model before updating the parameters,
represents the mean of the generic background model after updating the parameters,
show moreThe weight of the generic background model after the new parameters,
representing the variance of the general background model after updating the parameters, T representing the frame number of the training voice, gamma representing a relation factor, and constraining the change scale of the correction factor to make the sum of all the mixed weights be 1; a is
iω、a
im、a
ivFor adjusting new parameters of the UBM to shift each gaussian distribution in the UBM towards the targeted speaker profile.
In an alternative embodiment, the judgment result of the teaching mode is visualized through a PyQt5 interactive visualization GUI design tool; the visual result comprises a classroom utterance timing diagram and a classroom utterance distribution diagram;
in the class speaking timing sequence diagram, the horizontal axis represents the duration of a class in minutes, and the vertical axis represents the speaking duration of a teacher or a student in each minute of the class in seconds;
in the class utterance profile, the total time and respective proportion of the teacher utterance, the student utterance, and silence of the whole class are shown in the form of a pie chart.
In a second aspect, the present invention provides a teaching mode analysis system, comprising:
the classroom audio detection unit is used for detecting active sounds in classroom audio and marking the starting time and the ending time of each section of active sound; cutting the teaching audio according to the starting time and the ending time of each section of active audio to obtain a plurality of sections of active audio; the active tone refers to non-silent audio;
the speaker characteristic extraction unit is used for extracting different speaker characteristics and the time lengths of different speakers in each section of active audio based on the combined Mel cepstrum characteristic MFCC vector; the combined MFCC vector is obtained by transversely splicing an MFCC, a first-order differential MFCC vector and a second-order differential MFCC vector;
the speaking duration determining unit is used for respectively judging the characteristics of different speakers as teacher speaking and student speaking based on a pre-trained universal background model UBM and determining the corresponding teacher speaking duration and student speaking duration; the pre-trained UBM can fit the characteristics of different speakers, including teachers and students;
the teaching mode judging unit is used for judging whether the teaching mode of the classroom is an exercise classroom, a lecture classroom or a mixed classroom according to the proportion of the speaking time of the teacher to the total classroom time; when the ratio of the speaking time of the teacher to the total classroom time is lower than a first threshold, the teaching mode is an exercise classroom; when the ratio of the speaking time of the teacher to the total classroom time is greater than a second threshold, the teaching mode is a lecture classroom; otherwise, the teaching mode is considered as a mixed classroom; the first threshold is less than a second threshold.
In an optional embodiment, the classroom audio detection unit adopts a Gaussian mixture model GMM to detect an utterance part and a non-utterance part in classroom audio; wherein, the speaking part is active sound, and the non-speaking part is mute.
In an alternative embodiment, the speaker feature extraction unit extracts different speaker features of the collected multiple real classroom audios based on the combined MFCC vectors;
the speaking duration determining unit trains UBMs based on different speaker characteristics of a plurality of real classroom audios; the UBM can fit the characteristics of a large number of speakers, the characteristic data of a target speaker is scattered around the Gaussian distribution of the UBM, and each Gaussian distribution in the UBM is shifted to the characteristic data of the target speaker through a MAP adaptive algorithm; extracting collected teacher voice fragments of a plurality of real classroom audios based on the combined MFCC vector, and training a corresponding teacher GMM model on the basis of UBM; the teacher GMM model is a Gaussian mixture model trained by features extracted from audio of a teacher and used for simulating continuous probability distribution of voice vector features of the teacher; and scoring the different speaker characteristics through a GMM and UBM self-contained scoring method and a teacher GMM model, and judging the corresponding speaker characteristic teacher words if the score is higher than a preset threshold value, otherwise, judging the words as the student words.
In an alternative embodiment, the utterance duration determination orderMeta-computation target speaker characteristic data vector set X (X)1,X2,…,XT) Middle ith feature vector XiSimilarity p to the ith Gaussian componentr(i|xt):
Wherein x istFeature vector, ω, representing the target speaker at time tiRepresents the weight corresponding to the ith target speaker feature vector, pi(xt) Representing the probability score of the ith vector in the target speaker voice feature vector sequence relative to each UBM mixed vector, M representing the number of mixed Gaussian vectors, wjDenotes the weight, p, corresponding to the jth hybrid vectorj(xt) Representing the probability scores of each vector in the jth mixed vector sequence in the general background model relative to each UBM mixed vector;
obtaining a mean value E of a new universal background model UBM according to the similarityi(x) And variance Ei(x2) Parameters are as follows:
wherein n isiRepresenting the frame number of the target voice belonging to the ith mixed Gaussian vector;
and fusing the new parameters obtained in the last step with the original parameters of the UBM model to obtain a final GMM Gaussian mixture model of the target speaker:
wherein, a
iωWeight correction factor representing the Gaussian component of the generic background model, a
imMean correction factor, a, representing the Gaussian component of a generic background model
ivVariance correction factor, mu, representing the Gaussian component of the generic background model
iRepresenting the mean of the generic background model before updating the parameters,
represents the mean of the generic background model after updating the parameters,
representing the weight of the generic background model after updating the parameters,
representing the variance of the general background model after updating the parameters, T representing the frame number of the training voice, gamma representing a relation factor, and constraining the change scale of the correction factor to make the sum of all the mixed weights be 1; a is
iω、a
im、a
ivFor adjusting new parameters of the UBM to shift each gaussian distribution in the UBM towards the targeted speaker profile.
In an optional embodiment, the system further comprises: the visualization unit is used for visualizing the judgment result of the teaching mode through a PyQt5 interactive visualization GUI design tool; the visual result comprises a classroom utterance timing diagram and a classroom utterance distribution diagram; in the class speaking timing sequence diagram, the horizontal axis represents the duration of a class in minutes, and the vertical axis represents the speaking duration of a teacher or a student in each minute of the class in seconds; in the class utterance profile, the total time and respective proportion of the teacher utterance, the student utterance, and silence of the whole class are shown in the form of a pie chart.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
the invention provides a teaching mode analysis method and a teaching mode analysis system. The voice activity detection algorithm and the speaker recognition algorithm are utilized to detect, cut and recognize teaching audio, the recognition and analysis result is visualized through a PyQt5 interactive visual GUI design tool, the analysis result can help teachers and students to find the interactive frequency of words in the classroom teaching process, the teaching method is improved, and the teaching effect is improved.
The invention provides a teaching mode analysis method and a system, based on a traditional GMM-UNM speaker recognition model, a VAD activity voice detection algorithm is adopted to detect words and non-words in the process of preprocessing classroom audio to obtain the starting and ending time stamps of each segment, the classroom audio is cut according to the time stamps, a method which combines an artificial intelligence technology and can visualize the teaching mode is provided, and the distribution of words of teachers and students in the classroom and the word change curve chart of the teachers and students can be visually displayed by leading in the classroom audio.
The invention provides a teaching mode analysis method and a system, and provides a teaching mode analysis method, which is characterized by taking lesson time as a horizontal axis and teacher or student speaking time length in unit time as a vertical axis, and a class teacher-student speaking proportion pie chart.
Drawings
FIG. 1 is a flow chart of a teaching mode analysis method provided by an embodiment of the present invention;
FIG. 2 is a flow chart of teaching mode analysis provided by an embodiment of the present invention;
FIG. 3 is a flow chart of speaker identification provided by an embodiment of the present invention;
FIG. 4 is a timing diagram of classroom utterances provided by embodiments of the present invention;
FIG. 5 is a classroom utterance profile provided by an embodiment of the present invention;
fig. 6 is an architecture diagram of a teaching mode analysis system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention aims to detect, cut and identify words and non-words of classroom teaching audio based on Voice Activity Detection algorithm (VAD) and Gaussian mixture-universal background model (GMM-UBM), classify the identification result into three identities of Q (silence), S (student words) and T (teacher words), thereby automatically analyzing classroom teaching mode, and finally displaying the final result as a cake distribution graph and a curve graph of the teacher words, the student words and the silence.
Fig. 1 is a flowchart of a teaching mode analysis method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
s101, detecting active sounds in classroom audio, and marking the starting time and the ending time of each section of active sound; cutting the teaching audio according to the starting time and the ending time of each section of active audio to obtain a plurality of sections of active audio; the active tone refers to non-silent audio;
s102, extracting different speaker characteristics and durations of different speakers in each section of active audio based on the combined Mel frequency cepstrum characteristics MFCC vector; the combined MFCC vector is obtained by transversely splicing an MFCC, a first-order differential MFCC vector and a second-order differential MFCC vector;
s103, respectively judging the characteristics of different speakers as teacher speaking and student speaking based on a pre-trained universal background model UBM, and determining the corresponding teacher speaking duration and student speaking duration; the pre-trained UBM can fit the characteristics of different speakers, including teachers and students;
s104, judging whether the teaching mode of the classroom is an exercise classroom, a lecture classroom or a mixed classroom according to the proportion of the speaking time of the teacher to the total classroom time; when the ratio of the speaking time of the teacher to the total classroom time is lower than a first threshold, the teaching mode is an exercise classroom; when the ratio of the speaking time of the teacher to the total classroom time is greater than a second threshold, the teaching mode is a lecture classroom; otherwise, the teaching mode is considered as a mixed classroom; the first threshold is less than a second threshold.
The teaching mode analysis method of the invention is divided into three parts: voice activity detection and segmentation, speaker recognition and classroom pattern analysis visualization; the speaker recognition process comprises three steps: and (4) extracting features, establishing a model and predicting a result. The overall process flow is shown in figure 2. The audio of classroom teaching is firstly led into a teaching mode analysis system, detection and cutting are carried out on the audio of classroom by utilizing a voice activity detection algorithm, speaker recognition is respectively carried out on audio segments, the audio segments are divided into three types of silence, teacher words and student words according to recognition results, and finally visualization of the analysis results of the teaching mode is realized.
1. VAD voice activity detection algorithm
The VAD detection algorithm is used to detect the active part and the mute part in the audio frame by frame, and marks the active sound as 1, the mute as 0, and marks the start time and the end time. By designing an active speech generator in an algorithmic implementation, using a filled sliding window over the audio frames, the gatherer will trigger and start producing audio frames when more than 90% of the frames in the window are voiced. The collector will wait until 90% of the frames in the window are cleared. The timestamps are then stored in a txt file and a list of start and end timestamps is generated for use in subsequent audio cutting and result visualization stages.
1.1 Activity tone detection
Active tone detection typically employs a Gaussian Mixture Model (GMM), which is essentially a linear superposition of multiple Gaussian models, and in general, the signal distributions of spoken and non-spoken segments in speech can be represented by a weighted superposition of multiple Gaussian models:
probability distribution function P of Gaussian model of each dimension in formulai(Ot) Comprises the following steps:
wherein, mui、∑iRespectively mean vector and covariance matrix of the Gaussian model, their mixture number M with Gaussian, and weight omega of each dimensioniTogether, a gaussian mixture model is formed:
λ={M,ωi,μi,∑i}
and establishing respective Gaussian mixture models for the spoken and non-spoken parts in the classroom audio according to the algorithm, and then carrying out frame-by-frame detection on the whole audio and judging the similarity of the whole audio and the generated spoken and non-spoken models, thereby achieving the effect of distinguishing the spoken and non-spoken parts. And dividing the utterances (1) and the non-utterances (0) in the whole classroom audio frame by frame through a voice activity detection algorithm.
1.2 Audio cutting
The audio cutting is to determine a cutting edge by detecting a speech jumping point, and cut the whole class audio according to a start and end timestamp file obtained by VAD activity voice detection algorithm to obtain cut audio segments and start and end time nodes corresponding to each segment.
2. Speaker recognition
Fig. 3 is a flowchart of speaker recognition according to an embodiment of the present invention, as shown in fig. 3, including the following steps:
2.1 feature extraction
After the classroom audio is cut, features are extracted for each audio segment separately. In the feature extraction, the Mel cepstrum feature MFCC which best accords with the auditory features of human ears is used, and the MFCC, the first-order difference MFCC and the second-order difference MFCC are simultaneously combined for transverse splicing to obtain a 39-dimensional feature vector, so that the features of a speaker can be retained to the greatest extent, and more excellent feature extraction is realized.
2.2 model training
The method comprises the steps of preprocessing ten collected real classroom audios (each class is 45 minutes, 30-35 students and one teacher can meet the data volume requirement of a Universal Background Model), extracting Mel cepstrum characteristics, training a Universal Background Model (UBM), and paying attention to the fact that the number of non-target training sets is increased when the Universal Background Model is trained, the training effect of the Model is improved, and the generalization capability is improved.
In the Gaussian mixture-general background model, the UBM can fit the characteristics of a large number of speakers, the characteristic data of a target speaker is scattered around some Gaussian distributions of the UBM, and each Gaussian distribution in the UBM is shifted to target user data through a MAP adaptive algorithm. The specific calculation method is as follows:
computing a training vector set X (X)1,X2,…,XT) In (C) XiSimilarity to ith gaussian component:
then, updating the weight, mean and variance parameters according to the similarity:
fusing the updated parameters obtained in the last step with UBM parameters to obtain a final target speaker model:
wherein the adaptive parameter alphaiw,αim,αivFor adjusting the influence of the new parameters and the UBM parameters on the final model.
And extracting teacher voice segments of the target classroom, training a corresponding teacher GMM model on the basis of UBM, then respectively detecting each audio segment, skipping if the audio segments are non-voice, otherwise, extracting characteristics, and scoring with the teacher GMM model by a score method carried by GMM and UBM, wherein if the score is higher than a set threshold value, the teacher voice segment is the teacher voice, and otherwise, the student voice segment is determined as the student voice.
3. Result visualization
A window program is designed by adopting a python self-contained PyQt5 interactive visual GUI design tool to visualize the teaching mode analysis result, and the result is respectively displayed by a classroom utterance timing diagram and a classroom utterance distribution diagram. The analysis result of a high-quality primary school Chinese classroom can be shown through two mode analysis graphs, the classroom belongs to a teacher-guided classroom, the classroom atmosphere is active, and the teacher-student interaction is strong.
As shown in fig. 4, in the class speaking timing chart, the horizontal axis represents the duration of a class in minutes, and the vertical axis represents the speaking duration of the teacher or the student in the ith minute of the target class in seconds, so that the speaking situation of the teacher or the student in a certain time period can be intuitively and clearly observed in the visualization mode.
As shown in fig. 5, in the class speech distribution diagram, the total time and percentage of the teacher speech, the student speech and the silence of the whole class are displayed in the form of a pie chart, so that the whole degree of participation of the teacher and the students in the whole class can be grasped and recognized at a glance, and a series of subsequent teaching analyses can be facilitated.
Fig. 6 is an architecture diagram of a teaching mode analysis system according to an embodiment of the present invention, as shown in fig. 6, including:
the classroomaudio detection unit 610 is used for detecting active voices in classroom audio and marking the starting time and the ending time of each active voice; cutting the teaching audio according to the starting time and the ending time of each section of active audio to obtain a plurality of sections of active audio; the active tone refers to non-silent audio;
a speakerfeature extraction unit 620, configured to extract different speaker features and durations of different speakers in each segment of active audio based on the combined mel-frequency cepstrum feature MFCC vector; the combined MFCC vector is obtained by transversely splicing an MFCC, a first-order differential MFCC vector and a second-order differential MFCC vector;
the speakingduration determining unit 630 is used for respectively judging the characteristics of different speakers as teacher speaking and student speaking based on the pre-trained universal background model UBM, and determining the corresponding teacher speaking duration and student speaking duration; the pre-trained UBM can fit the characteristics of different speakers, including teachers and students;
a teachingmode judging unit 640, configured to judge that the teaching mode of the classroom is an exercise classroom, a lecture classroom, or a mixed classroom according to a ratio of the teacher speaking time to the total classroom time; when the ratio of the speaking time of the teacher to the total classroom time is lower than a first threshold, the teaching mode is an exercise classroom; when the ratio of the speaking time of the teacher to the total classroom time is greater than a second threshold, the teaching mode is a lecture classroom; otherwise, the teaching mode is considered as a mixed classroom; the first threshold is less than a second threshold.
Thevisualization unit 650 is used for visualizing the judgment result of the teaching mode through a PyQt5 interactive visualization GUI design tool; the visual result comprises a classroom utterance timing diagram and a classroom utterance distribution diagram; in the class speaking timing sequence diagram, the horizontal axis represents the duration of a class in minutes, and the vertical axis represents the speaking duration of a teacher or a student in each minute of the class in seconds; in the class utterance profile, the total time and respective proportion of the teacher utterance, the student utterance, and silence of the whole class are shown in the form of a pie chart.
Specifically, the functions of each unit in fig. 6 can be referred to the description in the foregoing method embodiment, and are not described herein again.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.