Movatterモバイル変換


[0]ホーム

URL:


CN115132223B - A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement - Google Patents

A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement
Download PDF

Info

Publication number
CN115132223B
CN115132223BCN202210748518.8ACN202210748518ACN115132223BCN 115132223 BCN115132223 BCN 115132223BCN 202210748518 ACN202210748518 ACN 202210748518ACN 115132223 BCN115132223 BCN 115132223B
Authority
CN
China
Prior art keywords
audio data
spectrogram
audio
mel
enhancement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210748518.8A
Other languages
Chinese (zh)
Other versions
CN115132223A (en
Inventor
刘海
张昭理
何嘉文
刘俊强
王书通
王坤
刘婷婷
杨兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Central China Normal University
Original Assignee
Hubei University
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University, Central China Normal UniversityfiledCriticalHubei University
Priority to CN202210748518.8ApriorityCriticalpatent/CN115132223B/en
Publication of CN115132223ApublicationCriticalpatent/CN115132223A/en
Application grantedgrantedCritical
Publication of CN115132223BpublicationCriticalpatent/CN115132223B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于时频增强的音频数据标注精度增强方法,该方法包括:数据采集流程,对教师的教学音频进行采集,再将教学音频中的原始信号通过线性声谱图的过渡量转化为梅尔声谱图;数据增强流程,使用VoiceAugment音频数据增强算法对输入的需要标注的教学音频进行数据增强,增强技术由频率信道掩蔽块和时间帧掩蔽块组成,增强了教学音频的特征属性;自动标注流程,使用ANNA模型对教学音频实现自动标注,ANNA模型由波谱图特征获取、声谱图特征获取、特征融合和情感标注等模块组成。本发明实现了教师教学情感的自动标注,提高了教学音频的标注速度和标注精度,克服了因人力标注所致的耗时费力等缺陷,为教师课堂的情感预测等服务提供了较为准确的数据标签。

The present invention discloses a method for enhancing the accuracy of audio data annotation based on time-frequency enhancement, which includes: a data collection process, in which the teacher's teaching audio is collected, and then the original signal in the teaching audio is converted into a Mel spectrogram through the transition amount of the linear spectrogram; a data enhancement process, in which the VoiceAugment audio data enhancement algorithm is used to perform data enhancement on the input teaching audio that needs to be annotated, and the enhancement technology is composed of a frequency channel masking block and a time frame masking block, which enhances the characteristic attributes of the teaching audio; an automatic annotation process, in which the ANNA model is used to automatically annotate the teaching audio, and the ANNA model is composed of modules such as spectrogram feature acquisition, spectrogram feature acquisition, feature fusion and emotion annotation. The present invention realizes the automatic annotation of the teacher's teaching emotion, improves the annotation speed and annotation accuracy of the teaching audio, overcomes the time-consuming and labor-intensive defects caused by human annotation, and provides more accurate data labels for services such as emotion prediction in the teacher's classroom.

Description

Audio data annotation precision enhancement method and system based on time-frequency enhancement
Technical Field
The invention relates to the technical field of audio mode recognition, in particular to a method and a system for enhancing audio data annotation precision based on time-frequency enhancement.
Background
With the rapid development of artificial intelligence technology, the education industry has developed more and more figures of artificial intelligence technology, and the application of artificial intelligence technology in the education and teaching field has become a trend. In the education industry, it plays a vital role in the application of artificial intelligence technology in various data during education and teaching, such as audio data when teachers teach class, which is vital to the development of artificial intelligence in the education field, after all, the development of artificial intelligence is not separated from a large amount of data.
In recent years, most of education and teaching processes are presented on line, so that the daily average data volume is huge, the data is also quite complicated, and the analysis by using artificial intelligence technology is not facilitated. In order to better use artificial intelligence technology to adjust and improve teacher teaching strategies more pertinently in the later period, at present, an automatic labeling technology is needed to automatically label audio data generated in the education and teaching process.
Disclosure of Invention
Aiming at least one defect or improvement requirement of the prior art, the invention provides a teaching emotion automatic labeling method for data enhancement of audio data of teacher teaching, which can realize enhancement of the audio data of teacher teaching and full utilization of audio information, and aims to improve effectiveness of the audio data of teacher teaching, rapidness of labeling the audio data of teacher teaching and accuracy of labeling the audio data of teacher teaching, wherein the effectiveness of the audio data of teacher teaching, the rapidness of labeling the audio data of teacher teaching and the accuracy of labeling the audio data of teacher teaching are acquired in the education process.
In order to achieve the above object, in a first aspect, the present invention provides a method for enhancing audio data annotation accuracy based on time-frequency enhancement, comprising the steps of:
collecting first audio data, and converting the first audio data into a Mel spectrogram through transition quantity comprising a linear spectrogram;
inputting the Mel spectrogram into an audio attribute labeling model to obtain a spectrogram and a logarithmic Mel spectrogram respectively;
Processing the spectrogram and the logarithmic mel spectrogram by using an audio data enhancement method to obtain enhanced first audio data;
And inputting the enhanced first audio data into the audio attribute labeling model to obtain an attribute tag classification result of the first audio data.
Further, the audio data enhancement method specifically comprises the following steps:
Adopting mixed masking based on small batches to carry out mean mixing on the masking region of the hiding state of the first audio data and the masking region of the hiding state of the second audio data so as to obtain the masking region of the enhanced hiding state of the first audio data; the masking regions include a frequency channel masking region and a time frame masking region.
Further, converting the first audio data into a mel-profile by a transition amount including a linear-profile specifically includes:
performing framing windowing and fast fourier transform operations on the first audio data to obtain the linear spectrogram;
applying a mel filter bank to the linear spectrogram to obtain a corresponding filter output;
And applying the filter output to an energy spectrum to obtain the Mel spectrogram with the attribute tag.
Further, inputting the mel-sound spectrogram into the audio attribute labeling model to obtain a spectrogram specifically includes:
inputting the Mel spectrogram into a first network structure of the audio attribute labeling model to obtain a corresponding time-frequency-invariant spectrogram feature vector;
The main architecture of the first network structure is CNN14, which applies a waveform from a one-dimensional convolutional neural network to a time domain; the first network structure comprises a one-dimensional convolution input layer, a plurality of one-dimensional convolution blocks and a one-dimensional convolution output layer; each one-dimensional convolution block is respectively composed of two convolution layers, and a layer behind each two convolution layers is a downsampling layer with a preset step length.
Further, inputting the mel-spectrogram into the audio attribute labeling model to obtain a logarithmic mel-spectrogram specifically includes:
Inputting the Mel spectrogram into a second network structure of the audio attribute labeling model to obtain a corresponding logarithmic Mel spectrogram characteristic vector with unchanged time frequency; the second network structure includes a two-dimensional convolution block.
Further, the method further comprises the steps of:
and inputting the spectrum graph feature vector and the logarithmic Mel spectrogram feature vector into the audio attribute labeling model, and obtaining an attribute tag classification result of the first audio data through feature connection operation and classification operation.
Further, the masking degree was set to a value of [10%,25% ].
In a second aspect, the present invention provides an audio data labeling accuracy enhancement system based on time-frequency enhancement, including:
The acquisition module is used for acquiring first audio data and converting the first audio data into a Mel spectrogram through transition quantity comprising a linear spectrogram;
The conversion module is used for inputting the Mel spectrogram into an audio attribute labeling model to respectively obtain a spectrogram and a logarithmic Mel spectrogram;
The enhancement module is used for processing the spectrogram and the logarithmic mel spectrogram by using an audio data enhancement method to obtain enhanced first audio data;
And the classification module is used for inputting the enhanced first audio data into the audio attribute labeling model to obtain an attribute tag classification result of the first audio data.
In a third aspect, the present invention provides an electronic device comprising at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program which, when executed by the processing unit, enables the processing unit to perform the steps of any one of the methods described above.
In a fourth aspect, the present invention provides a storage medium storing a computer program executable by an access authentication device, the computer program enabling the access authentication device to carry out the steps of any one of the methods described above when the computer program is run on the access authentication device.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) According to the audio data labeling precision enhancement method based on time-frequency enhancement, the waveform information and the Mel sound spectrum information are simultaneously used for teaching emotion automatic labeling, the novel data enhancement method is adopted to process teaching audio data, then the model is adopted to accurately extract the spectrum diagram features and the logarithmic Mel spectrum features, the spectrum diagram features and the logarithmic Mel spectrum features are connected, the problem of error labeling is reduced, and the method has important significance for improvement of classroom education evaluation.
(2) The invention adopts a mode of enhancing data in the input space and the hidden space of the deep neural network, and enhances the input and intermediate features by using frequency channel masking block and time frame masking block technology, which can ensure that the model focuses on the most discriminative part in the features and focuses on the whole part of the features, thereby improving the generalization capability of the data. And the spectrogram and the logarithmic Mel spectrogram are combined into a new representation form in the model, the time-frequency domain spectrogram and the frequency domain logarithmic Mel spectrogram can be fused into the time-frequency domain representation, and more reference information is provided for the automatic labeling of the final teaching emotion label, so that the accuracy of the automatic labeling of the teaching emotion of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of an audio data labeling accuracy enhancement method based on time-frequency enhancement according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a scenario of audio data acquisition in a teacher teaching environment according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a network structure of an automatic labeling model for teaching emotion provided by an embodiment of the present invention;
fig. 4 is a block schematic diagram of an electronic device suitable for implementing the method described above according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The terms first, second, third and the like in the description and in the claims or in the above drawings, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" or "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed or inherent to such process, method, article, or apparatus but may alternatively include other steps or elements not listed or inherent to such process, method, article, or apparatus.
The automatic labeling of teaching emotion requires labeling and classifying work for the audio of teacher teaching, and classification processing for the audio of teacher teaching according to emotion is beneficial to efficient development of follow-up work. For the provided audio data of teacher teaching, the appropriate data enhancement can improve the equilibrium of the audio data of teacher teaching, thereby improving the accuracy of automatic labeling of teaching emotion. Therefore, after the provided audio data of the teacher teaching are subjected to data enhancement, the waveform and the sound spectrum of the enhanced audio data of the teacher teaching are utilized to automatically label the audio data of the teacher teaching, so that the utilization rate of the data of the teaching process can be improved. Meanwhile, the method is compared with manual marking of the audio data of teacher teaching, and the automatic marking of the teaching emotion through data enhancement is realized, so that the speed of data arrangement labels is greatly increased, and the method has the advantages of time saving, labor saving, money saving and the like.
As shown in fig. 1, in one embodiment, a method for enhancing the accuracy of audio data labeling based on time-frequency enhancement mainly includes the following major flow steps.
S1, a data acquisition part.
The audio data may be any section of voice data to be marked, and in this embodiment, the audio data is audio data of teacher teaching in the education and teaching process, that is, recording data of teaching sites, and fig. 2 is a schematic view of a scene of audio data acquisition in the teacher teaching environment. In an alternative embodiment, the collected and acquired audio data of the teacher teaching are first preprocessed as required, specifically:
Preprocessing input audio data of teacher teaching by pre-emphasis, framing and windowing, wherein in a specific example, a Hamming window is adopted as a window function of the windowing operation, the window size of the Hamming window is 1024, the audio data of teacher teaching after framing is 100 frames per second, and then short-time Fourier transformation is adopted for processing, which is equivalent to fast Fourier transformation on a series of audio data of teacher teaching, so that a linear spectrogram is obtained. The fast fourier transform has a parameter N, which indicates how many points are to be fast fourier transformed, and if the number of points in a frame is smaller than N, zero padding is performed, each point corresponds to a frequency point, a frequency represented by a certain point N is fn=(n-1)*fs/N, where s is a step size, and fs is a frame frequency of a sound frame, that is, a frequency separated by several sound frames.
Then 64 mel filter sets are applied to the linear spectrogram, and the expression of the filter output is:
where M is from 1 to M, M represents the number of filter banks (where M is taken as 64), k represents the number of points, and Hm (k) represents the filter output value. And then, the output of the filter is applied to the energy spectrum to obtain a Mel spectrogram, wherein the Mel spectrogram has the expression:
wherein |X (k) |2 represents the energy of the kth point in the energy spectrum, the output in the frequency range of each filter is used as the weight to carry out multiplication calculation with the energy of the corresponding frequency in the energy spectrum, and finally, an accumulation operation, namely, the process of calculating the weighted sum is carried out; melSpec (m) represents a Mel spectrogram.
An audio file of teacher teaching with abnormality (for example, an audio clip of teacher teaching with empty content) is cleaned.
S2, a data enhancement part.
In this embodiment, voiceAugment audio data enhancement methods are used to enhance the input audio data of the teacher teaching.
The VoiceAugment audio data enhancement method comprises a frequency channel masking block and a time frame masking block, and is processed by adopting mixed masking based on small batches.
Assume thatRepresenting the intermediate hidden state of the input data, where T and F represent the number of time frames and frequency channels, respectively. The use of time frame masking for successive time frames t0,t0 + t means that the specified element is replaced with zero or another value, where t is a value selected from a uniform distribution of 0 to the time masking parameter t', and t0 is a value selected from 0, t-t. Similarly, frequency channel masking is employed for the continuous frequency channel [ f0,f0 +f ], where f is a value first selected from a uniform distribution of 0 to the frequency masking parameter f', and f0 is a value selected from [0,F-f ]. To simplify the process, for an audio sample taught by a teacher, the same time masking and frequency masking are performed for each intermediate hidden state of the same layer in the sample, namely:
t0=t=f0=f
In one specific example, there are C feature maps at the first layer of the CNN (Convolutional Neural Network ) model, and then these C feature maps share the same time and frequency regions in one training iteration and mask the audio samples of a single teacher teaching.
For example: to enhance the hidden state of the first layer of the audio sample of the selected tutor tutorial (i.e. the first audio data mentioned in the claims) it is necessary to randomly select another audio sample of the tutorial in the same small batch as the audio sample of the selected tutorial (i.e. the second audio data mentioned in the claims) and to mask with the hidden state of the first layer of this audio sample of the tutorial randomly selected. The selected data enhancement method based on the small-batch mixed masking processes the masking areas of the hidden states of the audio samples of the two selected teacher teaching according to the mean value mixing mode.
The specific contents of the data enhancement algorithm based on the small-batch mixed masking are as follows:
input: hidden state of audio sample for selected teacher teachingHidden state of audio sample of another teacher teaching in the same small lotThe number of consecutive time frames t, the index of start time t0, the number of consecutive frequency channels f, the index of start frequency f0.
And (3) outputting: hidden state of audio sample enhancement of selected teacher teaching
The more specific flow is as follows:
Initializing the value of x' to x;
When i epsilon [ t0,t0 +t ], j epsilon [0,F ], calculating x' [ i, j ]:
When i epsilon [0, T ], j epsilon [ f0,f0 +f ], x' i, j is calculated:
and returning the value of x' at the moment to obtain a final result.
In a specific example, the audio data of the teacher teaching is substituted as a selection sample into a data enhancement algorithm based on a small batch of mixed masking, so that the input space and the hidden space of the audio data of the teacher teaching can be subjected to data enhancement, and the audio data of the teacher teaching after the data enhancement can be obtained. The specific process is as follows:
downsampling the input audio data of teacher teaching to 22.05kHz;
The audio data of teacher teaching adopts short-time Fourier transform with window w of 2048 and overlap rate o of 25%;
then a Mel-scale filter bank is used for audio data of teacher teaching;
Generating 256 filter bins and 43 frames per second of data;
Normalizing the input audio data of teacher teaching to zero mean value;
The Adam optimizer uses a total of 350 batches with an initial learning rate i of 1 x 10-4;
The learning rate i decays linearly from stage 50 until stage 250 i reaches 5 x 10-6, training 100 more batches with the minimum learning rate of i of 5 x 10-6;
The super-parameter t0=43,f0 = 26 in the experiment, with time frames and frequency channels being approximately 10% masked.
Because the time frames and the frequency channels are masked by the corresponding parts of the audio data taught by other teachers, the interference is introduced, and the resolution of the network is trained. Hybrid masking based on small batches is a natural algorithm that preserves the entire sample information of the audio selected for tutor teaching, while the interference of the time frames and frequency channels of the audio samples of other tutor teaching can enhance robustness. The degree of masking has a significant impact on the effectiveness of the data enhancement, and too much masking may cause the original teacher teaching audio data to lose a significant amount of information, severely confusing the training of the model. The data enhancement performance is best at 10%,25%, the masking employed in the example is 10%,10% sufficient to direct the data enhancement algorithm to focus on low resolution time frames and frequency channels and to increase robustness.
S3, automatically labeling the part.
Fig. 3 is a schematic diagram of a network structure of an automatic labeling model for teaching emotion (ANNA model) provided in this embodiment, as shown in fig. 3, where the automatic labeling model for teaching emotion (that is, an audio attribute labeling model mentioned in the claims, such as happiness, sadness, etc. emotion is actually an attribute of a certain emotion of audio) includes a first network structure and a second network structure; the first network structure obtains a spectrogram by using the input mel-frequency spectrogram, and the second network structure obtains a logarithmic mel-frequency spectrogram by using the input mel-frequency spectrogram.
And the model connects and predicts the spectrum graph characteristics output by the first network structure and the logarithmic Mel spectrum graph characteristics output by the second network structure, and finally obtains the emotion label classification condition of the audio of the teacher teaching.
In practice, the above audio data enhancement method may also be used to process the spectral features and log mel features to obtain enhanced first audio data; and then, inputting the enhanced first audio data into the automatic teaching emotion labeling model, and obtaining an attribute tag classification result (emotion tag classification condition) of the first audio data.
In this embodiment, the first network structure is a one-dimensional convolutional neural network, which is used to extract a spectrogram of time-frequency information.
In one specific example, CNN14 is used as the primary architecture for the first network structure.
A one-dimensional CNN-to-time domain waveform is applied.
The one-dimensional CNN starts from a convolution layer, the size of a filter w of the layer is 11, the step length s is 5, so that the input size is reduced, the length of input content is reduced by 5 times by the operation, the use of a memory is reduced, and the overall use efficiency is improved.
Then 3 convolutions, each consisting of two convolutions layers, the design of which aims to enlarge the acceptance field of the convolutions layers.
The next layer of each convolution block is a downsampled layer with a step size s of 4.
Through 3 steps and downsampling, audio of a 32KHz teacher teaching can be downsampled to a characteristic of 100 frames per second.
Wherein, the output size of the one-dimensional CNN is T×C, T is the number of frames, and C is the number of channels;
by dividing the number of channels C into C/f groups, each of the divided groups has a frequency f that allows the final spectral map to learn the frequency information, the output can be reshaped into a tensor of size T, i.e., the final output spectral map.
The first network structure obtains a spectrogram by using the input Mel spectrogram, and the flow comprises the following steps:
Obtaining original audio data of teacher teaching, carrying out frame windowing and fast Fourier transformation on the audio data to obtain a linear spectrogram, then applying a Mel filter on the linear spectrogram, and then taking logarithm to obtain a Mel spectrogram with emotion labels;
and inputting the Mel spectrogram into a first network structure, and obtaining a corresponding time-frequency-invariant spectrogram feature vector through the network structure.
In this embodiment, a neural network is used to extract a time-frequency representation of a waveform, and frequency information lacking in the network is learned in a one-dimensional convolutional neural network, so that a new time-frequency transformation, i.e., a spectrum diagram feature vector, is learned.
In this embodiment, the second network model is a two-dimensional convolutional neural network, and is used to extract a log mel spectrogram.
The second network structure obtains a logarithmic mel spectrogram by using the input mel spectrogram, and the flow comprises the following steps:
Obtaining audio data of teacher teaching, carrying out framing windowing and fast Fourier transformation on the audio data to obtain a linear spectrogram, then applying a Mel filter on the linear spectrogram, and then taking logarithm to obtain a Mel spectrogram with emotion labels;
And inputting the Mel spectrogram into a second network structure, and obtaining a corresponding logarithmic Mel spectrogram characteristic vector with unchanged time frequency through the network structure.
Operating on the input data using a two-dimensional convolution block;
generating a corresponding characteristic diagram through the processing of the second network structure;
And inputting the Mel spectrogram into a second network structure, and obtaining a corresponding logarithmic Mel spectrogram characteristic vector with unchanged time frequency through the network structure.
The automatic labeling model of teaching emotion further comprises a connecting network and a classifying network.
The connecting network is respectively connected with the output sides of the first network structure and the second network structure and is used for connecting the spectrum diagram characteristic output by the first network structure and the logarithmic mel-sound spectrum characteristic output by the second network structure to obtain the connecting characteristic.
In a specific example, the classification network may sequentially employ a ReLU classifier and a Sigmoid classifier, which generate emotion tag classification cases of teacher teaching audio according to the connection features.
The final network structure of the model is shown in fig. 3, and includes:
12 convolutional layers with kernel size 3 x 3 and 2 fully-connected layers.
Each 2 convolutional layers is followed by an average pooling layer of size 2 x 2 for downsampling.
The feature map is aggregated into a fixed length vector using global pooling after the last convolutional layer.
An additional full-connection layer is added behind the global pooling layer to extract embedded features, so that the representation capability of the features is further improved.
Linear classifiers (ReLU and Sigmoid) are embedded in the features to implement the labeling task of classification.
After the automatically marked emotion of the teacher teaching audio is obtained, the emotion of the teacher can be further analyzed, so that the teaching capability of the teacher is improved.
In one embodiment, an audio data annotation accuracy enhancement system based on time-frequency enhancement mainly comprises:
The acquisition module is used for acquiring first audio data and converting the first audio data into a Mel spectrogram through transition quantity comprising a linear spectrogram;
The conversion module is used for inputting the Mel spectrogram into an audio attribute labeling model to respectively obtain a spectrogram and a logarithmic Mel spectrogram;
The enhancement module is used for processing the spectrogram and the logarithmic mel spectrogram by using an audio data enhancement method to obtain enhanced first audio data;
And the classification module is used for inputting the enhanced first audio data into the audio attribute labeling model to obtain an attribute tag classification result of the first audio data.
Fig. 4 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the invention. The electronic device shown in fig. 4 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the invention.
As shown in fig. 4, the electronic device 1000 described in the present embodiment includes: a processor 1001 which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1001 may also include on-board memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In the RAM 1003, various programs and data required for the operation of the system 1000 are stored. The processor 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to the bus 1004. The system 1000 may also include one or more of the following components connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
The method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1001. The systems, devices, apparatus, modules or units etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
Embodiments of the present invention also provide a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include one or more memories other than the ROM 1002 and/or RAM 1003 described above.
It should be noted that, in each embodiment of the present invention, each functional module may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such an understanding, the technical solution of the invention may be embodied essentially or partly in the form of a software product or in part in addition to the prior art.
The flowcharts or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments of the disclosure and/or the claims may be made without departing from the spirit and teachings of the disclosure, all of which fall within the scope of the disclosure.
While the present disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the appended claims and their equivalents. The scope of the disclosure should, therefore, not be limited to the above-described embodiments, but should be determined not only by the following claims, but also by the equivalents of the following claims.

Claims (9)

CN202210748518.8A2022-06-292022-06-29 A method and system for enhancing audio data annotation accuracy based on time-frequency enhancementActiveCN115132223B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202210748518.8ACN115132223B (en)2022-06-292022-06-29 A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202210748518.8ACN115132223B (en)2022-06-292022-06-29 A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement

Publications (2)

Publication NumberPublication Date
CN115132223A CN115132223A (en)2022-09-30
CN115132223Btrue CN115132223B (en)2024-11-26

Family

ID=83380303

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202210748518.8AActiveCN115132223B (en)2022-06-292022-06-29 A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement

Country Status (1)

CountryLink
CN (1)CN115132223B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115881112A (en)*2022-12-082023-03-31大连东软信息学院Speech recognition data enhancement method based on feature replacement and masking of spectrogram
CN117235668A (en)*2023-09-222023-12-15东南大学CNN model fusion-based fault diagnosis method and system for heavy-duty gearbox

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112199548A (en)*2020-09-282021-01-08华南理工大学Music audio classification method based on convolution cyclic neural network
CN114639377A (en)*2022-03-232022-06-17中南大学 A comprehensive teaching video voice extraction text method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112634928B (en)*2020-12-082023-09-29北京有竹居网络技术有限公司Sound signal processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112199548A (en)*2020-09-282021-01-08华南理工大学Music audio classification method based on convolution cyclic neural network
CN114639377A (en)*2022-03-232022-06-17中南大学 A comprehensive teaching video voice extraction text method

Also Published As

Publication numberPublication date
CN115132223A (en)2022-09-30

Similar Documents

PublicationPublication DateTitle
SchlüterLearning to Pinpoint Singing Voice from Weakly Labeled Examples.
CN110457432B (en)Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
Li et al.An attention pooling based representation learning method for speech emotion recognition
Alkayyali et al.A new algorithm for audio files augmentation
CN112215696A (en)Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis
CN115132223B (en) A method and system for enhancing audio data annotation accuracy based on time-frequency enhancement
CN114863937B (en) Hybrid bird song recognition method based on deep transfer learning and XGBoost
CN111488486B (en) A kind of electronic music classification method and system based on multi-sound source separation
CN112559797A (en)Deep learning-based audio multi-label classification method
Rowe et al.Acoustic auto-encoders for biodiversity assessment
CN115641533A (en) Target object emotion recognition method, device and computer equipment
Singhal et al.Multi-level region-of-interest CNNs for end to end speech recognition
CN114420097A (en) Voice positioning method, apparatus, computer readable medium and electronic device
CN117746910A (en)Dual-channel CNN-LSTM lung sound classification model training method and system
CN119478525A (en) A method and system for analyzing ideological and political classroom interaction based on multimodal fusion
Sharma et al.HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language
CN119537527A (en) A method, device, computer equipment and storage medium for generating recommendation words
CN113823271B (en)Training method and device for voice classification model, computer equipment and storage medium
Osman et al.A deep learning approach for recognizing the noon rule for reciting Holy Quran
CN118053420A (en)Speech recognition method, apparatus, device, medium and program product
CN115049546B (en) Sample data processing method, device, electronic device and storage medium
Paramitha et al.Performance comparison of deep learning algorithm for speech emotion recognition
CN117012225A (en)LSTM-Attention-based parkinsonism voice signal recognition method, device and related medium
Song et al.Advancing Bird Classification: Harnessing PSA-DenseNet for Call-Based Recognition
Naiborhu et al.Indonesian continuous speech recognition using cnn and bidirectional lstm

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp