CN114067793B

Movatterモバイル変換

Info

Publication number: CN114067793B
Application number: CN202111302400.4A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2025-07-04
Anticipated expiration: 2041-11-04
Also published as: CN114067793A

Abstract

The disclosure provides an audio processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of voice processing, in particular to the fields of artificial intelligence, voice technology and deep learning. The method comprises the steps of obtaining audio to be processed, wherein the audio to be processed comprises initial audio data acquired from a plurality of sound sources, the sound sources correspond to a plurality of objects, carrying out content identification on the audio to be processed to obtain content vectors and time information corresponding to the content vectors, and separating the audio to be processed based on the content vectors and the time information to obtain separation results, wherein the separation results are used for determining target audio data corresponding to each object in the plurality of objects from the initial audio data. Through the implementation scheme, the separation method and the separation device achieve the effects of improving the accuracy of separation results and improving the distinguishing property of the whole characteristics, and solve the problem that the separation effect of the voice separation method provided in the related technology is poor.

Description

Audio processing method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to the fields of artificial intelligence, speech technology, and deep learning. The disclosure provides an audio processing method and device, electronic equipment and a readable storage medium.

Background

In intelligent customer service, conference discussion, interview dialogue, etc., sound from multiple users is often collected on a single sound channel, so that recorded audio needs to be separated from voice, and then targeted analysis and processing are performed on the sound of different users. At present, collected audio can be separated by an offline human voice separation method, firstly, the audio is cut into equal-length small fragments, and then the number of speakers in the audio or a threshold value is given for separation. But if the collected sounds of a plurality of users overlap, the separation effect is poor.

Disclosure of Invention

The disclosure provides an audio processing method and device, electronic equipment and a readable storage medium.

According to a first aspect of the present disclosure, an audio processing method is provided, which includes obtaining audio to be processed, where the audio to be processed includes initial audio data acquired from a plurality of sound sources, where the plurality of sound sources correspond to a plurality of objects, performing content recognition on the audio to be processed to obtain a content vector and time information corresponding to the content vector, and separating the audio to be processed based on the content vector and the time information to obtain a separation result, where the separation result is used to determine target audio data corresponding to each of the plurality of objects from the initial audio data.

According to a second aspect of the present disclosure, an audio processing apparatus is provided, which includes an acquisition module configured to acquire audio to be processed, where the audio to be processed includes initial audio data acquired from a plurality of sound sources, where the plurality of sound sources correspond to a plurality of objects, an identification module configured to identify content of the audio to be processed, to obtain a content vector and time information corresponding to the content vector, and a separation module configured to separate the audio to be processed based on the content vector and the time information, to obtain a separation result, where the separation result is configured to determine target audio data corresponding to each of the plurality of objects from the initial audio data.

According to a third aspect of the present disclosure there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to the above determination.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method according to the above determination.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to the above determination.

According to the embodiment of the disclosure, after the audio to be processed is obtained, the content vector and the time information can be obtained by carrying out content identification on the audio to be processed, and the audio to be processed is separated by combining the content vector and the time information, so that the aim of voice separation is fulfilled. It is easy to notice that, because the content vector and the time information are combined in the process of separating the voice, the complete content information can be reserved in the cut audio segment, so that the feature vector corresponding to the audio segment is more distinguishable, the effect of improving the accuracy of the separation result and the distinguishing property of the whole feature is achieved, and the problem of poor separation effect of the voice separation method provided in the related technology is solved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an audio processing method according to the present disclosure;

FIG. 2 is a schematic diagram of an audio separation model and an auxiliary separation model according to the present disclosure;

FIG. 3 is a schematic diagram of an audio processing device according to the present disclosure;

Fig. 4 is a block diagram of an electronic device for implementing an audio processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, a common algorithm for separating human voice can comprise TDNN (TIME DELAY Neural Networks, time delay neural network) -xvector (used for extracting user feature vectors) +AHC (agglomerative hierarchical clustering, hierarchical clustering), but the scheme is complex, is not an end-to-end implementation scheme, training and testing processes can be mismatched, and the separation effect is not ideal for the situation that a plurality of user voices are overlapped.

According to an embodiment of the present disclosure, the present disclosure provides an audio processing method, as shown in fig. 1, which may include the steps of:

step S102, obtaining audio to be processed, wherein the audio to be processed comprises initial audio data acquired from a plurality of sound sources, and the plurality of sound sources correspond to a plurality of objects.

In some embodiments, in a scenario of a multi-person conversation, such as a conference discussion, interview conversation, a variety of shows, etc., the sounds of a plurality of different users may be collected by microphones, resulting in initial audio data from a plurality of different sound sources.

Step S104, carrying out content recognition on the audio to be processed to obtain a content vector and time information corresponding to the content vector.

The time information in the above steps may refer to a time stamp of the content vector including, but not limited to, a start time and a duration of the content vector.

In a scene of a multi-user conversation, a certain correlation exists in the context of the same user speaking, in order to improve the effect of voice separation, content recognition can be performed on audio to be processed in a machine learning mode, text information of the audio to be processed is recognized, and time stamps of pronunciation units with different granularities, namely, starting time and duration of the pronunciation units in the audio to be processed are recognized, wherein the different granularities can be phonemes (phones), words (characters), words (words) and the like, but the method is not limited to the method. In addition, the text information can be cut according to the separation precision requirement and the specific granularity to obtain a plurality of texts, and further feature extraction is performed on each text to obtain a feature vector of each text, namely the content vector is obtained, and the content vector is called content embedding.

And S106, separating the audio to be processed based on the content vector and the time information to obtain a separation result, wherein the separation result is used for determining target audio data corresponding to each object in the plurality of objects from the initial audio data.

In some embodiments, the audio to be processed may be cut based on time information to obtain multiple audio segments, where different audio segments have different lengths, unlike existing uniform cuts. And further identifying a plurality of audio clips by combining the content vector in a machine learning mode, determining the user corresponding to each audio clip, and finally summarizing the audio clips of the same user to obtain the target audio data of each user.

For example, in a conference discussion scene, audio data in the whole conference process can be collected through a sound collecting device such as a microphone and the like to serve as audio to be processed, as a plurality of participants speak in the whole process, and the speaking time of each participant is not fixed, after the audio to be processed is collected, the time stamps of pronunciation units with different granularities can be determined by identifying the content of the audio to be processed, and then the audio to be processed is segmented according to the identified time stamps to obtain audio segments, at the moment, the granularity of the audio segments is the same as the granularity of content vectors, and the speaking audio of each participant can be accurately determined by identifying the audio segments by combining the content vectors, so that the aim of voice separation is achieved.

Through the steps, after the audio to be processed is obtained, the content vector and the time information can be obtained by carrying out content identification on the audio to be processed, and the audio to be processed is separated by combining the content vector and the time information, so that the aim of voice separation is fulfilled. It is easy to notice that, because the content vector and the time information are combined in the process of separating the voice, the complete content information can be reserved in the cut audio segment, so that the feature vector corresponding to the audio segment is more distinguishable, the effect of improving the accuracy of the separation result and the distinguishing property of the whole feature is achieved, and the problem of poor separation effect of the voice separation method provided in the related technology is solved.

Optionally, performing content recognition on the audio to be processed to obtain a content vector and time information comprises recognizing the audio to be processed by using a forced alignment model to obtain text information and time information, and performing feature extraction on the text information by using a feature generation model to obtain the content vector.

The forced alignment model may be a model obtained by training in advance based on a common model framework such as GMM (Gaussian Mixture Model ) -HMM (Hidden Markov Model, hidden markov model), LSTM ((Long Short Term Memory, long and short term memory network) -CTC (Connectionist Temporal Classification, cascade timing classification), chain, CNN (convolutional neural networks, convolutional neural network) RNN (Recurrent Neural Network, cyclic neural network) -T) and the like through open source data Aishell or LibriSpeech and the like. The input of the model can be a Mel spectrum of audio frequency, the output is the predicted probability of each pronunciation unit, the training loss is CE (Cross Entropy Loss ), and the model is converged after repeated iteration for a plurality of rounds, so that a model with stable performance is obtained, and corresponding text information and time information, namely, the timestamp of a phone, a word (character) and a word (word) can be obtained by inputting the Mel spectrum when the model is applied.

The feature generation model may be a common feature extraction model, which is not specifically limited in this disclosure, and is input into a pronunciation unit such as a phone (phone), a word (character), a word (word), and the like, and output as a corresponding feature vector.

In some embodiments, after the audio to be processed is obtained, a Mel spectrum of the audio to be processed may be extracted, for example, the audio to be processed may be processed by a Mel-scale filter bank, and transformed to obtain a corresponding Mel spectrum, but is not limited thereto. The Mel spectrum is input to the forced alignment model to obtain corresponding text information and time information, e.g. phonemes and corresponding time information, and then the text information is mapped to feature vectors by the feature generation model, i.e. to obtain the content vectors described above.

Through the steps, the content recognition is carried out on the audio to be processed by constructing the forced alignment model and the feature generation model in advance, so that the content recognition efficiency and accuracy are improved, and the effect of accurate separation of human voice is further improved.

Optionally, the content vector comprises a feature vector of a plurality of texts with preset granularity, the time information comprises time stamps of the plurality of texts, the audio to be processed is separated based on the time information corresponding to the content vector and the content vector, the separation result comprises cutting the audio to be processed based on the time stamp of each text to obtain a plurality of target audios, and the audio separation model is utilized to separate the plurality of target audios based on the feature vectors of the plurality of texts to obtain the separation result.

The above-mentioned predetermined granularity may be a phone (phone), a word (character), a word (word), etc., but is not limited thereto. The text of the preset granularity may be a sound unit in the audio to be processed.

The above-described audio separation model may be a human voice separation module based on a PIT (Permutation INVARIANT TRAIN, displacement invariant training) deep neural network. As shown in FIG. 2, the model may be composed of multiple layers of BLSTM (Binary Long-Short Term Memory, two-way Long-short-term memory model), linear mapping layer, sigmod activation layer, with a trained loss function of BCE (Binary Cross Entropy ), and PIT strategy. The training audio contains 2 users A and B respectively, the predicted output results are shown in the output layer in the figure, and the corresponding relation between the upper and lower results and the two users in the output layer is not known, so that the two conditions can be calculated, and the result with smaller Loss is selected as the final Loss. It should be noted that if sounds of 2 users overlap on an event, the probability that both the results correspond is relatively high.

In some embodiments, the audio to be processed may be cut with the same granularity according to the time information, so as to obtain a plurality of audio segments (i.e. a plurality of target audio as described above), that is, the granularity of the input of the audio separation model is equivalent to content embedding, and both may be phonemes, or may be words or characters. And extracting Mel spectrum from each audio fragment, and then inputting the Mel spectrum and the feature vector of each pronunciation element output by the feature generation model into the audio separation model for separation to obtain a final human voice separation result.

Through the steps, the audio to be processed is cut based on the time stamp of each text, the cut target audio is ensured to keep complete content, and the feature vectors of a plurality of target audios and a plurality of texts are subjected to voice separation by utilizing the pre-trained audio separation model, so that the effect of improving the voice separation efficiency and the accuracy is achieved.

The audio separation model at least comprises a first layer of bidirectional long-short time memory model and a second layer of bidirectional long-short time memory model, wherein the audio separation model is used for separating a plurality of target audios based on feature vectors of a plurality of texts, and the separation result is obtained by inputting the plurality of target audios into the first layer of bidirectional long-short time memory model to be processed, so that a first output vector is obtained, splicing the feature vectors of the first output vector and the plurality of texts to obtain a spliced vector, and inputting the spliced vector into the second layer of bidirectional long-short time memory model to be processed, so that the separation result is obtained.

In some embodiments, since the audio separation model needs to process the feature vector and the plurality of audio segments of each text, the Mel spectrum may be extracted for each audio segment and then input to the packaged BLSTM to obtain the feature vector H with the corresponding granularity, then the output vector H of the first packaged BLSTM is spliced with the feature vector C of the corresponding text to obtain the vector M, and the vector is input to the second packaged BLSTM as an input of a subsequent module of the audio separation model, as shown in fig. 2.

Through the steps, the first output vector and the feature vectors of the texts are spliced, so that the content of the texts can be fully considered in the process of identifying the first output vector by the audio separation model, and the effects of enabling the spliced features to be more distinguishable and improving the accuracy of voice separation are achieved.

The method comprises the steps of obtaining training samples, wherein the training samples comprise training audio and labeling results corresponding to the training audio, the training audio comprises audio data acquired from a plurality of training sound sources, the plurality of training sound sources correspond to a plurality of training objects, content recognition is conducted on the training audio to obtain training vectors corresponding to the training audio and time information corresponding to the training vectors, the training audio is separated based on the training vectors and the time information corresponding to the training vectors to obtain first prediction results, the first prediction results are used for representing probabilities of training objects corresponding to the training vectors, the labeling results and the first prediction results are processed to obtain first loss functions, and model parameters of an audio separation model are adjusted based on the first loss functions.

The training samples may be audio data of a large number of received multi-person conversations and contain a small proportion of aliasing (10% -20%). The labeling result can be training objects corresponding to different audio clips under a specific granularity obtained through manual labeling.

It should be noted that, in order to improve the quality of the training sample, the training sample may be preprocessed, including noise removal, including environmental noise, busy tone, ringing, etc., but not limited thereto, to obtain high quality audio. In addition, in order to ensure that the number of training samples meets the training requirement, data enhancement may be performed on high quality audio, including but not limited to time domain warping, frequency domain masking, and the like.

In some embodiments, for training audio, the Mel spectrum may be extracted and then input to a forced alignment model to obtain a plurality of texts and time stamps, then the plurality of texts should be subjected to feature generation model to obtain feature vector C, and then the training audio is cut with the same granularity according to time information. As shown in fig. 2, the Mel spectrum is extracted for each audio segment and then input to the first packaged BLSTM to obtain an advanced feature vector H with a corresponding granularity, the H and C are spliced to obtain M, the M is input to the second packaged BLSTM to obtain a corresponding first prediction result, namely the probability of a corresponding training object, a corresponding first loss function is obtained through the strategy PIT, and the model parameters of the audio separation model are updated based on the loss function, so that a high-performance audio separation model is obtained through training.

Through the steps, the audio separation model is trained through the training sample, and the training is ensured to obtain a high-performance audio separation model, so that the effect of improving the accuracy of human voice separation is achieved.

Optionally, after the labeling result and the first prediction result are processed to obtain a first loss function, the method further comprises the steps of obtaining a target vector, wherein the target vector is a vector of a second-layer bidirectional long-short-time memory model input to the audio separation model, predicting the target vector by using the auxiliary separation model to obtain a second prediction result corresponding to the target vector, wherein the second prediction result is used for representing a training object corresponding to the target vector, generating a second loss function based on the labeling result and the second prediction result, obtaining a total loss function based on the first loss function and the second loss function, and adjusting model parameters of the audio separation model based on the total loss function.

The auxiliary separation model described above may be composed of a Linear mapping layer, a Tanh activation layer, and Normalize regular layers, as shown in the right model in fig. 2. The input is the spliced vector formed by splicing the output vector of the first packaged BLSTM and content embedding, and the output is the feature vector of the corresponding training object. The feature vectors herein represent the vocal features of the training object, such as vocal cords, oral size, nasal cavity, throat, etc., by which the feature vectors can be used to perform similarity comparisons on the training object's audio secondary. The loss function of the auxiliary separation model training can adopt a 2-norm loss function, and is used for calculating the error between the output result and the labeling result of the auxiliary separation model, wherein the calculation formula is as follows:

Where J^DC denotes the second loss function, V denotes the output, v= [ V₁,…,v_T]^T T denotes the number of audio pieces after training audio cutting, L 'denotes the labeling result, dimension T x 2^c, C denotes the number of training objects, each line in L' is in the form of one-hot (only one 1 is 0), for example, assuming that C is 2, the cut audio pieces correspond to 4 cases: 0: non-spech, 1:speaker 1,2:speaker 2,3:overlapping, if the first audio piece is silent, the first line of L is [ 100 0], the second audio piece is speaker 1, the second line is [ 01 00 ], the third audio piece is speaker, the third line is [ 0030 ], the fourth audio piece is speaker 1 and speaker 2, the fourth line is [ 0001 ], F denotes the type of norm, and f=2 denotes 2 norms.

In some embodiments, as shown in fig. 2, for training audio, after obtaining the spliced vector M through the foregoing steps, the spliced vector M may be input to the auxiliary separation model while being input to the second Stacked BLSTM, to obtain a corresponding second prediction result, that is, a feature vector of a corresponding training object, and obtain a corresponding second loss function through 2-norm loss, and by weighting and summing the two loss functions, a total loss function may be obtained, where a calculation formula is as follows:

J^MULTI＝(1-α)J^PIT+αJ^DC.,

Where J^PIT represents the first loss function, J^MULTI represents the total loss function, α represents a hyper-parameter, and is used to adjust the weights of the two loss functions, preferably to a value of 0.4.

Based on the total loss function, the model parameters of the audio separation model are updated by adopting the existing optimization algorithm (such as a random gradient descent algorithm, a least square method and the like), so that the first Stacked BLSTM can learn knowledge of the auxiliary separation model, namely the splicing model contains feature vectors of different training objects, and a high-performance audio separation model is obtained through training.

Through the steps, in the audio separation model training process, the total loss function is calculated by combining the output result of the auxiliary separation model, so that the audio separation model obtained by training can learn the pronunciation characteristics of different training objects, the audio fragments belonging to the same user can be accurately determined in the human voice separation process, and the effect of improving the human voice separation accuracy is achieved.

Optionally, adjusting the model parameters of the audio separation model based on the total loss function includes adjusting the model parameters of the audio separation model using a stochastic gradient descent algorithm based on the total loss function.

In some embodiments, after the total loss function is calculated, a random gradient descent algorithm (Stochastic GRADIENT DESCENT, SGD) may be used to calculate the gradient of the loss function, thereby updating the model parameters of the audio separation model, by iterating through multiple rounds until convergence. The implementation process of the random gradient descent algorithm is the same as that of the prior art, and is not described here in detail.

Through the steps, the model parameters are adjusted through a random gradient descent algorithm, so that the effects of reducing learning time and improving training efficiency of the audio separation model are achieved.

Based on the analysis, the method and the device can obtain the pronunciation unit and the corresponding time information through the forced alignment model, can cut the content information of the audio by combining the time information, and are different from the traditional equal-length cutting, so that the complete content information can be reserved, the content information is added on the basis of the characteristics of the user to which the audio fragment belongs, the integral characteristics are more differentiated, an end-to-end speaker separation system is built on the basis of a sequence-independent criterion, a variable speaker number (within the maximum speaker number range supported by a network) is supported during use, the whole structure is simple, a better separation effect is achieved for the situation of overlapping speakers, and an auxiliary separation model of deep clustering is introduced, so that the accuracy of voice separation is further improved through a double-loss function.

It should be noted that, the audio to be processed in this embodiment is not specific to the audio output of a specific user, and cannot reflect the personal information of a specific user, and the acquisition, storage, application, etc. of the audio data in this embodiment all conform to the regulations of the related laws and regulations, and do not violate the popular regulations.

According to an embodiment of the present disclosure, the present disclosure further provides an audio processing apparatus, which is used to implement the foregoing embodiment and the preferred implementation manner, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 3 is a schematic diagram of an audio processing apparatus according to the present disclosure, as shown in fig. 3, the apparatus includes an obtaining module 32 configured to obtain audio to be processed, where the audio to be processed includes initial audio data collected from a plurality of sound sources, the plurality of sound sources corresponding to a plurality of objects, an identifying module 34 configured to identify content of the audio to be processed, to obtain a content vector and time information corresponding to the content vector, and a separating module 36 configured to separate the audio to be processed based on the content vector and the time information, to obtain a separation result, where the separation result is used to determine target audio data corresponding to each of the plurality of objects from the initial audio data.

Optionally, the identification module comprises an identification unit for identifying the audio to be processed by using the forced alignment model to obtain text information and time information, and an extraction unit for extracting features of the text information by using the feature generation model to obtain a content vector.

Optionally, the content vector comprises a feature vector of a plurality of texts with preset granularity, the time information comprises time stamps of the plurality of texts, the separation module comprises a cutting unit and a separation unit, the cutting unit is used for cutting audio to be processed based on the time stamps of each text to obtain a plurality of target audios, and the separation unit is used for separating the plurality of target audios based on the feature vector of the plurality of texts by utilizing an audio separation model to obtain a separation result.

The audio separation model at least comprises a first layer of bidirectional long-short time memory model and a second layer of bidirectional long-short time memory model, and the separation unit is further used for inputting a plurality of target audios into the first layer of bidirectional long-short time memory model to be processed to obtain a first output vector, splicing the first output vector and feature vectors of a plurality of texts to obtain a spliced vector, and inputting the spliced vector into the second layer of bidirectional long-short time memory model to be processed to obtain a separation result.

The device comprises an acquisition module, a recognition module, a separation module and a processing module, wherein the acquisition module is further used for acquiring training samples, the training samples comprise training audio and marking results corresponding to the training audio, the training audio comprises audio data acquired from a plurality of training sound sources, the plurality of training sound sources correspond to a plurality of training objects, the recognition module is further used for carrying out content recognition on the training audio to obtain training vectors corresponding to the training audio and time information corresponding to the training vectors, the separation module is further used for separating the training audio based on the training vectors and the time information corresponding to the training vectors to obtain a first prediction result, the first prediction result is used for representing the probability of the training objects corresponding to the training vectors, the processing module is used for processing the marking results and the first prediction result to obtain a first loss function, and the adjustment module is used for adjusting model parameters of an audio separation model based on the first loss function.

The device comprises an acquisition module, a prediction module, a first generation module, a second generation module and an adjustment module, wherein the acquisition module is used for acquiring a target vector, the target vector is a vector of a second-layer bidirectional long-short-time memory model input to an audio separation model, the prediction module is used for predicting the target vector by utilizing an auxiliary separation model to obtain a second prediction result corresponding to the target vector, the second prediction result is used for representing a training object corresponding to the target vector, the first generation module is used for generating a second loss function based on a labeling result and the second prediction result, the second generation module is used for obtaining a total loss function based on the first loss function and the second loss function, and the adjustment module is used for adjusting model parameters of the audio separation model based on the total loss function.

Optionally, the adjusting module comprises an adjusting unit for adjusting model parameters of the audio separation model using a random gradient descent algorithm based on the total loss function.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 4 illustrates a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The various components in the device 400 are connected to an I/O interface 405, including an input unit 406, e.g., keyboard, mouse, etc., an output unit 407, e.g., various types of displays, speakers, etc., a storage unit 408, e.g., magnetic disk, optical disk, etc., and a communication unit 409, e.g., network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above, for example, an audio processing method. For example, in some embodiments, the audio processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When a computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the audio processing method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the audio processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An audio processing method, comprising:

acquiring audio to be processed, wherein the audio to be processed comprises initial audio data acquired from a plurality of sound sources, and the plurality of sound sources correspond to a plurality of objects;

Performing content identification on the audio to be processed to obtain a content vector and time information corresponding to the content vector;

Separating the audio to be processed based on the content vector and the time information to obtain a separation result, wherein the separation result is used for determining target audio data corresponding to each object in the plurality of objects from the initial audio data;

The content vector comprises feature vectors of a plurality of texts with preset granularity, the time information comprises time stamps of the plurality of texts, the audio to be processed is separated based on the content vector and the time information, and a separation result is obtained, wherein the audio to be processed is cut based on the time stamp of each text to obtain a plurality of target audios;

The audio separation model at least comprises a first layer of bidirectional long-short time memory model and a second layer of bidirectional long-short time memory model, the plurality of target audios are separated based on the feature vectors of the plurality of texts to obtain separation results, the separation results comprise the steps of inputting the plurality of target audios into the first layer of bidirectional long-short time memory model to be processed to obtain a first output vector, splicing the first output vector and the feature vectors of the plurality of texts to obtain spliced vectors, and inputting the spliced vectors into the second layer of bidirectional long-short time memory model to be processed to obtain the separation results.

2. The method of claim 1, wherein the content identifying the audio to be processed to obtain a content vector and the time information comprises:

Identifying the audio to be processed by using a forced alignment model to obtain text information and the time information;

And extracting the characteristics of the text information by utilizing a characteristic generation model to obtain the content vector.

3. The method of claim 1 or 2, further comprising:

Acquiring training samples, wherein the training samples comprise training audio and labeling results corresponding to the training audio, and the training audio comprises audio data acquired from a plurality of training sound sources, and the plurality of training sound sources correspond to a plurality of training objects;

Performing content recognition on the training audio to obtain a training vector corresponding to the training audio and time information corresponding to the training vector;

Separating the training audio based on the training vector and time information corresponding to the training vector to obtain a first prediction result, wherein the first prediction result is used for representing the probability of a training object corresponding to the training vector;

Processing the labeling result and the first prediction result to obtain a first loss function;

Model parameters of the audio separation model are adjusted based on the first loss function.

4. The method of claim 3, further comprising, after processing the labeling result and the first prediction result to obtain a first loss function:

obtaining a target vector, wherein the target vector is a vector input to a second layer bidirectional long-short-time memory model of the audio separation model;

Predicting the target vector by using an auxiliary separation model to obtain a second prediction result corresponding to the target vector, wherein the second prediction result is used for representing a training object corresponding to the target vector;

generating a second loss function based on the labeling result and the second prediction result;

obtaining a total loss function based on the first loss function and the second loss function;

Model parameters of the audio separation model are adjusted based on the total loss function.

5. The method of claim 4, wherein adjusting model parameters of the audio separation model based on the total loss function comprises:

Based on the total loss function, model parameters of the audio separation model are adjusted by using a random gradient descent algorithm.

6. An audio processing apparatus, comprising:

the acquisition module is used for acquiring audio to be processed, wherein the audio to be processed comprises initial audio data acquired from a plurality of sound sources, and the plurality of sound sources correspond to a plurality of objects;

the identification module is used for carrying out content identification on the audio to be processed to obtain a content vector and time information corresponding to the content vector;

The separation module is used for separating the audio to be processed based on the content vector and the time information to obtain a separation result, wherein the separation result is used for determining target audio data corresponding to each object in the plurality of objects from the initial audio data;

The content vector comprises feature vectors of a plurality of texts with preset granularity, the time information comprises time stamps of the plurality of texts, and the separation module comprises a cutting unit, a separation unit and a separation unit, wherein the cutting unit is used for cutting the audio to be processed based on the time stamp of each text to obtain a plurality of target audios;

The audio separation model at least comprises a first layer of bidirectional long-short time memory model and a second layer of bidirectional long-short time memory model, and the separation unit is further used for inputting the plurality of target audios into the first layer of bidirectional long-short time memory model to be processed to obtain a first output vector, splicing the first output vector and the feature vectors of the plurality of texts to obtain a spliced vector, and inputting the spliced vector into the second layer of bidirectional long-short time memory model to be processed to obtain the separation result.

7. The apparatus of claim 6, wherein the identification module comprises:

the recognition unit is used for recognizing the audio to be processed by using the forced alignment model to obtain text information and the time information;

and the extraction unit is used for carrying out feature extraction on the text information by utilizing a feature generation model to obtain the content vector.

8. An electronic device, comprising:

at least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-5.