CN119360869B

Movatterモバイル変換

Info

Publication number: CN119360869B
Application number: CN202411565144.1A
Authority: CN
Inventors: 张结; 徐擎天; 凌震华
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-11-05
Filing date: 2024-11-05
Publication date: 2025-09-26
Anticipated expiration: 2044-11-05
Also published as: CN119360869A

Abstract

The application provides a nerve-guided voice enhancement method based on personalized brain electrode distribution, wherein a training method of a voice enhancement model in the voice enhancement method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of training sample combinations corresponding to at least one training user, each training sample combination comprises real biological auxiliary information, mixed distribution auxiliary information, gaussian distribution auxiliary information, mixed training voice information and enhanced voice labels of the training user, the auxiliary information characterizes attention degree of the training user to a certain identified user in the mixed training voice information, the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information and the mixed training voice information are utilized to conduct countermeasure training on a plurality of initial enhancement models to obtain a plurality of enhanced voice information, initial loss values are generated according to the enhanced voice information and the enhanced voice labels, and network parameters of the initial enhancement models are iteratively adjusted according to the initial loss values to obtain the voice enhancement model.

Description

Nerve guidance type voice enhancement method based on personalized brain electrode distribution

Technical Field

The present application relates to the field of speech signal processing technology, and more particularly, to a training method, a speech enhancement method, a training device, a speech enhancement device, an electronic apparatus, a computer-readable storage medium, and a computer program product for a neural-guided speech enhancement model based on personalized brain electrode distribution.

Background

Speech enhancement (SE, SPEECH ENHANCEMENT) aims at extracting the target speaker's speech signal from various audio signals mixed with noise interference. It is widely used in hearing aid devices (HAs, HEARING AIDS) and is the core algorithm for hearing aids. When the noise is from other speakers, it is also called Target Speaker Extraction (TSE).

The conventional TSE method is implemented by means of auxiliary information containing information of the targeted speaker, and the main auxiliary information includes preregistered audio, visual assistance, spatial information assistance and the like. However, it is difficult to acquire these auxiliary information in a practical application scenario. With the recent development of brain science and human brain auditory system research, researchers have found that auditory attention information, such as a target speech envelope, can be extracted from biological auxiliary information of a listener, such as brain electrical signals (EEG, electroencephalogram). This facilitates route generation using EEG signals to assist the TSE. Initially, researchers have extracted target speech or non-end-to-end solutions based on similarity by first doing envelope extraction and separation. This does not achieve a desired result and it is not necessary to separate first. To overcome these problems, other researchers have proposed better end-to-end models that utilize special feature fusion modules (FiLM, feature-WISE LINEAR modeling).

In brain-assisted speech enhancement practical applications, too many electrodes do not greatly improve performance, but rather may cause excessive cost. This makes electrode channel selection more useful in this field. For example, the related art proposes an end-to-end EEG channel selection method based on a special profile (Gumbel-Softmax Distribution). But it has a problem of difficulty in training and a problem of repeated passage. While another related art adds a weight residual connection to improve training stability, the duplicate channel selection problem is still not addressed. Another related art proposes a new approach based on attention and special constraint functions to implement an approach that does not contain duplicate channels and does not require explicit specification of the number of selected channels. However, the current method does not pay attention to the problem of personalized selection, and the selected result is unchanged for all users, so that the problem of large variance is caused.

Disclosure of Invention

In view of this, the present application provides a training method, a speech enhancement method, a training apparatus, a speech enhancement apparatus, an electronic device, a computer-readable storage medium and a computer program product for a neural guided speech enhancement model based on personalized brain electrode distribution.

One aspect of the present application provides a training method of a neural guided speech enhancement model based on personalized brain electrode distribution, comprising:

Responding to a model training instruction, acquiring a training set, wherein the training set comprises a plurality of training sample combinations corresponding to at least one training user, the training sample combinations comprise real biological auxiliary information, mixed distribution auxiliary information, gaussian distribution auxiliary information, mixed training voice information and enhanced voice labels of the training user, and any auxiliary information comprises electroencephalogram information representing the attention degree of the training user to a certain identification user in the mixed training voice information;

aiming at each training sample combination, performing countermeasure training on a plurality of initial enhancement models by using the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information and the mixed training voice information to obtain a plurality of enhancement voice information;

generating an initial loss value according to the enhanced voice information and the enhanced language tag for each piece of the enhanced voice information;

And iteratively adjusting network parameters of the initial enhancement model according to a plurality of initial loss values to obtain a trained voice enhancement model.

According to an embodiment of the present application, performing countermeasure training on a plurality of initial enhancement models using the real biological auxiliary information, the mixed distribution auxiliary information, the gaussian distribution auxiliary information, and the mixed training voice information to obtain a plurality of enhanced voice information, including:

Training a first initial enhancement model by utilizing the real biological auxiliary information and the mixed training voice information to obtain first enhancement voice information;

Training a second initial enhancement model by utilizing the mixed distribution auxiliary information and the mixed training voice information to obtain second enhancement voice information;

training a third initial enhancement model by utilizing the Gaussian distribution auxiliary information and the mixed training voice information to obtain third enhancement voice information;

And training a fourth initial enhancement model by using the mixed training voice information to obtain fourth enhancement voice information.

According to an embodiment of the present application, generating the enhanced speech information using any of the initial enhancement models includes:

processing the target auxiliary information by using a target selection module to obtain a biological feature vector, wherein the target auxiliary information comprises the real biological auxiliary information, the mixed distribution auxiliary information or the Gaussian distribution auxiliary information;

and processing the biological characteristic vector, the mixed training voice information and the target auxiliary information by utilizing a voice enhancement module to generate the enhanced voice information.

According to an embodiment of the present application, processing the above-mentioned target auxiliary information by using a target selection module to obtain a biometric vector includes:

Processing the target auxiliary information by using a plurality of depth separation convolution units to obtain a plurality of biological convolution characteristics;

Processing a plurality of biological convolution characteristics by using a linear layer to obtain biological linear characteristics;

And generating the biological characteristic vector according to the biological linear characteristic and the training parameters of the adaptive neurons.

According to an embodiment of the present application, the processing of the above target auxiliary information by using a plurality of depth separation convolution units to obtain a plurality of biological convolution features includes:

Performing depth separation convolution processing on the target auxiliary information aiming at each depth separation convolution unit to obtain a first convolution characteristic;

Filling the first convolution feature to obtain a second convolution feature;

and carrying out pooling treatment on the second convolution characteristic to obtain the biological convolution characteristic.

According to an embodiment of the present application, processing the biometric vector, the mixed training speech information, and the target auxiliary information by a speech enhancement module to generate the enhanced speech information includes:

Generating a biological feature to be processed according to the biological feature vector and the target auxiliary information;

and processing the biological characteristics to be processed and the mixed training voice information by using a convolution time domain separation network to obtain the enhanced voice information.

According to an embodiment of the present application, processing the to-be-processed biological feature and the mixed training speech information by using a convolutional time domain separation network to obtain the enhanced speech information includes:

Respectively carrying out coding treatment on the biological characteristics to be treated and the mixed training voice information to obtain biological coding characteristics and language coding characteristics;

Performing sound source separation processing on the biological coding features and the language coding features to obtain voice mask information;

generating target coding information according to the voice mask information and the language coding characteristics;

And decoding the target coding information to obtain the enhanced voice information.

According to an embodiment of the present application, generating an initial loss value for each of the enhanced speech information according to the enhanced speech information and the enhanced language tag includes:

generating a signal distortion loss value according to the enhanced voice information and the enhanced language tag for the first enhanced voice information;

generating a first confusion loss value according to the enhanced voice information and the enhanced language tag aiming at the second enhanced voice information;

Generating a second confusion loss value according to the enhanced voice information and the enhanced language tag aiming at the third enhanced voice information;

And generating a third confusion loss value according to the enhanced voice information and the enhanced language tag aiming at the fourth enhanced voice information.

According to an embodiment of the present application, iteratively adjusting network parameters of the initial enhancement model according to a plurality of the initial loss values to obtain a trained speech enhancement model, including:

generating a first loss value based on the signal distortion loss value and the first aliasing loss value;

Generating a second loss value according to the first loss value and the selected mean square error loss;

generating a target loss value based on the second loss value, the second aliasing loss value, and the third aliasing loss value;

And iteratively adjusting network parameters of the initial enhancement model according to the target loss value to obtain the voice enhancement model.

Another aspect of the present application provides a neural guided speech enhancement method based on personalized brain electrode distribution, comprising:

Responding to a model training instruction, acquiring biological auxiliary data and mixed voice data of a target user, wherein the biological auxiliary data comprises brain electrical information representing the attention degree of the target user to a certain attention user in the mixed voice data;

And inputting the biological auxiliary data and the mixed voice data into a voice enhancement model, and outputting enhanced voice data, wherein the voice of the concerned user in the enhanced voice data is enhanced.

Another aspect of the present application provides a training device for a neural guided speech enhancement model based on personalized brain electrode distribution, comprising:

The first acquisition module is used for responding to a model training instruction to acquire a training set, wherein the training set comprises a plurality of training sample combinations corresponding to at least one training user, the training sample combinations comprise real biological auxiliary information, mixed distribution auxiliary information, gaussian distribution auxiliary information, mixed training voice information and enhanced voice labels of the training user, and any auxiliary information comprises electroencephalogram information representing the attention degree of the training user to a certain recognition user in the mixed training voice information;

The countermeasure training module is used for performing countermeasure training on a plurality of initial enhancement models by utilizing the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information and the mixed training voice information aiming at each training sample combination to obtain a plurality of enhancement voice information;

the generating module is used for generating an initial loss value according to the enhanced voice information and the enhanced language tag for each piece of the enhanced voice information;

and the adjusting module is used for iteratively adjusting the network parameters of the initial enhancement model according to a plurality of initial loss values to obtain a trained voice enhancement model.

Another aspect of the present application provides a nerve-guided speech enhancement device based on personalized brain electrode distribution, comprising:

The second acquisition module is used for responding to the model training instruction and acquiring the biological auxiliary data and the mixed voice data of the target user, wherein the biological auxiliary data comprises brain electrical information representing the attention degree of the target user to a certain attention user in the mixed voice data;

and the enhancement module is used for inputting the biological auxiliary data and the mixed voice data into a voice enhancement model and outputting enhanced voice data, wherein the voice of the concerned user in the enhanced voice data is enhanced.

Another aspect of the present application provides an electronic device, comprising:

One or more processors;

a memory for storing one or more programs,

Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods as described above.

Another aspect of the application provides a computer readable storage medium storing computer executable instructions that when executed are to implement a method as described above.

Another aspect of the application provides a computer program product comprising computer executable instructions which when executed are for implementing a method as described above.

According to the embodiment of the application, the plurality of initial enhancement models are subjected to countermeasure training by using the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information and the mixed training voice information to obtain a plurality of enhancement voice information, initial loss values are generated according to the enhancement voice information and the enhancement language labels, and network parameters of the initial enhancement models are iteratively adjusted according to the plurality of initial loss values to obtain the trained voice enhancement models. The voice enhancement model only needs single type biological auxiliary information, so that the number of electrodes is reduced, and meanwhile, the voice enhancement model trained by using the biological auxiliary information can respond to personalized selection of users, so that voice of users concerned is enhanced, and the abnormal memory problem of the voice enhancement model in the related technology is effectively avoided.

Drawings

The above and other objects, features and advantages of the present application will become more apparent from the following description of embodiments of the present application with reference to the accompanying drawings, in which:

FIG. 1 illustrates an exemplary system architecture for a training method or a speech enhancement method to which a speech enhancement model may be applied, according to an embodiment of the present application;

FIG. 2 illustrates a flow chart of a method of training a speech enhancement model according to an embodiment of the application;

FIG. 3 shows a flow chart of a training method of a speech enhancement model according to another embodiment of the application;

FIG. 4 illustrates a flow chart of a method of generating enhanced speech information according to an embodiment of the application;

FIG. 5 shows a process flow diagram of a target selection module according to an embodiment of the application;

FIG. 6 is a schematic diagram showing experimental comparative analysis results of a speech enhancement model of the present application with no challenge training model according to an embodiment of the present application;

FIG. 7 shows a flow chart of a training method of a speech enhancement model according to an embodiment of the application;

FIG. 8 shows a block diagram of a training apparatus of a speech enhancement model according to an embodiment of the application;

FIG. 9 shows a block diagram of a speech enhancement apparatus according to an embodiment of the application, and

Fig. 10 shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the application.

Detailed Description

Hereinafter, embodiments of the present application will be described with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the application. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the application. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In embodiments of the present application, the data involved (e.g., including but not limited to user personal information) is collected, updated, analyzed, processed, used, transmitted, provided, disclosed, stored, etc., all in compliance with relevant legal regulations, used for legal purposes, and without violating the public welfare. In particular, necessary measures are taken for personal information of the user, illegal access to personal information data of the user is prevented, and personal information safety and network safety of the user are maintained.

In embodiments of the present application, the user's authorization or consent is obtained before the user's personal information is obtained or collected.

FIG. 1 illustrates an exemplary system architecture 100 in which a training method or speech enhancement method of a speech enhancement model may be applied in accordance with an embodiment of the present application. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages etc. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, and/or social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 shows a flow chart of a method of training a speech enhancement model according to an embodiment of the application.

As shown in FIG. 2, the training method of the nerve guidance type voice enhancement model based on personalized brain electrode distribution comprises operations S201-S204.

In operation S201, in response to a model training instruction, a training set is obtained, where the training set includes a plurality of training sample combinations corresponding to at least one training user, the training sample combinations include real biological auxiliary information, mixed distribution auxiliary information, gaussian distribution auxiliary information, mixed training voice information, and enhanced voice tags of the training user, and any auxiliary information includes electroencephalogram information characterizing a degree of attention of the training user to a certain recognition user in the mixed training voice information;

in operation S202, for each training sample combination, performing countermeasure training on a plurality of initial enhancement models by using the real biological auxiliary information, the mixed distribution auxiliary information, the gaussian distribution auxiliary information and the mixed training voice information, to obtain a plurality of enhancement voice information;

Generating an initial loss value according to the enhanced voice information and the enhanced language tag for each enhanced voice information in operation S203;

In operation S204, network parameters of the initial enhancement model are iteratively adjusted according to the plurality of initial loss values, resulting in a trained speech enhancement model.

According to an embodiment of the present application, the electroencephalogram information refers to electroencephalogram (EEG) assistance information, in which the electroencephalogram information can also be replaced with Electromyographic (EMG) assistance information, electrocardiographic (ECG) assistance information, electrooculogram (EOG) assistance information, etc., the present application is exemplified with electroencephalogram (EEG) assistance information, for example, when a training user is engaged in a meeting, the training user focuses on what is said by the speaker a, and at this time, the EEG signal of the training user can be collected as real bio-auxiliary information, and the voices doped with other speakers in the meeting place can be collected as mixed training voice information.

According to the embodiment of the application, the Gaussian distribution auxiliary information refers to auxiliary information obtained by carrying out Gaussian distribution processing on the real biological auxiliary information. The mixed distribution auxiliary information refers to auxiliary information obtained by carrying out Gaussian distribution and uniform distribution processing on the real biological auxiliary information. Enhanced speech tagging refers to the use of enhanced speech as a tag.

According to the embodiment of the application, for each training sample combination, the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information, the mixed training voice information and the enhanced voice label in the combination are utilized to simultaneously conduct countermeasure training on a plurality of initial enhancement models, and at the moment, each initial enhancement model outputs enhanced voice information. According to the enhanced voice information and the enhanced language label, an initial loss value corresponding to the initial enhancement model can be generated, and according to the initial loss value synchronization of the initial enhancement models, the network parameters of the initial enhancement models are adjusted, so that the trained voice enhancement model can be obtained.

FIG. 3 shows a flow chart of a method of training a speech enhancement model according to another embodiment of the application.

As shown in fig. 3, performing countermeasure training on the plurality of initial enhancement models by using the real bio-assist information, the mixed distribution assist information, the gaussian distribution assist information, and the mixed training voice information to obtain a plurality of enhanced voice information, including:

using real bio-assist informationAnd mixing training speech informationTraining a first initial enhancement model to obtain first enhanced voice information;

using hybrid distribution assistance informationAnd mixing training speech informationTraining the second initial enhancement model to obtain second enhanced voice information;

using gaussian distributed side informationAnd mixing training speech informationTraining the third initial enhancement model to obtain third enhanced voice information;

Training speech information using a mixtureTraining the fourth initial enhancement model to obtain fourth enhanced voice information.

According to the embodiment of the application, the problem of excessive memory of the model is solved through multi-process countermeasure training, and the model is prevented from memorizing the target speaker. As shown in fig. 3, the parameters are shared by multiple leg models (i.e., initial enhancement models), and three new legs replace the original EEG signal with various nulls (as in fig. 2、Without auxiliary information) The situation of the target speaker may be remembered by the simulation model and suppressed by the confusion loss. The aliasing penalty is the square of the SI-SDR.

Fig. 4 shows a flow chart of a method of generating enhanced speech information according to an embodiment of the application.

According to an embodiment of the present application, as shown in fig. 4, generating enhanced speech information using any of the initial enhancement models includes:

processing target auxiliary information by using a target selection module to obtain a biological feature vector, wherein the target auxiliary information comprises real biological auxiliary information, mixed distribution auxiliary information or Gaussian distribution auxiliary information;

And processing the biological characteristic vector, the mixed training voice information and the target auxiliary information by utilizing the voice enhancement module to generate enhanced voice information.

According to the embodiment of the application, the voice enhancement model based on biological assistance mainly comprises two parts, namely a target selection module for personalized selection and a subsequent voice enhancement module.

According to an embodiment of the application, after inputting the target auxiliary information into the initial enhancement model, the target selection module in the initial enhancement model first converts the target auxiliary information into a biological feature vector in which attention information of the training user is implicit. The enhanced speech information, i.e., the enhanced speech of fig. 4, is then obtained by processing the biometric vector, the hybrid training speech information, and the target auxiliary information using the speech enhancement module.

FIG. 5 shows a process flow diagram of a target selection module according to an embodiment of the application.

According to an embodiment of the present application, as shown in fig. 5, processing target auxiliary information by using a target selection module to obtain a biometric vector includes:

a biometric vector is generated based on the biometric characteristic and the training parameters of the adaptive neuron.

According to an embodiment of the application, in FIG. 5I.e. the training parameters of the adaptive neurons,As a result of the bio-linear characteristic,I.e. a biometric vector.

According to an embodiment of the present application, in order to capture individual differences, the model incorporates adaptive neurons. These neurons, through training, generate individual difference vectors that are combined with the generic selection vector that the target selection module gets at the time of pre-training to form the final individual adaptive EEG channel selection vector. The output of the adaptive neurons is limited to the range of [ -1, 1] by the activation function, ensuring that the individual differences captured are appropriate and controllable. Wherein the generic selection vector is generated by pre-training a generic selection model that is independent of individual differences (i.e. the initial enhancement model) on the generic data set and freezing its generic selection layer (i.e. the target selection module). In order to keep the output stable when different signals are input into the model, the input of the general selection layer is replaced by a full vector, and the module is ensured to be irrelevant to the input. In this way, the pre-trained model can output a stable generic selection vector, called "average fitness", which represents the most informative EEG signal path among multiple subjects.

According to an embodiment of the present application, the training process is divided into two phases, the first phase is individual difference learning and the second phase is selection fine tuning.

According to an embodiment of the present application, as in fig. 5 (a), in a first stage, the model learns each individual difference tested by adaptive neurons and combines it with the average fitness to generate an individual adaptive EEG channel selection vector (i.e. a biometric vector). In this process, the model optimizes a loss function containing speech enhancement related losses and regularization terms to ensure that the model can accurately capture individual differences while maintaining a certain channel selection stability. The generated EEG channel selection vector is used to construct a subset of EEG signals for each test and is gradually optimized in multiple iterations until the loss function converges.

According to an embodiment of the application, as shown in fig. 5 (b), the second stage is a fine tuning stage, where the model freezes the adaptive neurons and further obtains the final adaptive EEG channel selection scheme by a threshold determiner. The threshold determiner will determine whether to select a corresponding channel based on the value of the individual difference vector, ultimately forming an EEG channel selection scheme for each subject. At this stage, the model's goal is to further optimize the goal selection module by incorporating a convolutional time domain separation network base to enhance the performance of the speech enhancement model.

According to an embodiment of the present application, as shown in fig. 5, processing target auxiliary information by using a plurality of depth separation convolution units to obtain a plurality of biological convolution features includes:

Filling the first convolution feature to obtain a second convolution feature;

According to the embodiment of the application, the number of the depth separation convolution units can be specifically set according to requirements, for example, the number of the depth separation convolution units can be 8. The sample vector in fig. 5 is the target auxiliary information.

According to an embodiment of the present application, processing a biometric vector, mixing training speech information, and target auxiliary information using a speech enhancement module to generate enhanced speech information includes:

And processing the biological characteristics to be processed and the mixed training voice information by using a convolution time domain separation network to obtain enhanced voice information.

According to an embodiment of the present application, for a biometric vectorPerforming product processing with target auxiliary information y (t) to generate a to-be-processed biological feature, and inputting the to-be-processed biological feature and the mixed training voice information x (t) into a convolution time domain separation network BASEN to obtain enhanced voice information, namely, in FIG. 5. Wherein, the convolution time domain separation network BASEN is Conv-TasNet network.

According to an embodiment of the present application, referring to fig. 4, processing biological characteristics to be processed and mixed training voice information by using a convolution time domain separation network to obtain enhanced voice information includes:

respectively encoding the biological characteristics to be processed and the mixed training voice information to obtain biological encoding characteristicsAnd language encoding features;

For biological coding featuresAnd language encoding featuresPerforming sound source separation processing to obtain voice mask information;

Based on speech mask informationAnd language encoding featuresGenerating target coding information;

According to an embodiment of the application, an encoder that mixes training speech information extracts embedded features of audio by multi-layer one-dimensional convolution, converting the input audio signal into a representation that can be better processed.

According to an embodiment of the application, an encoder of the biometric feature to be processed downsamples the biometric feature to be processed, such as an EEG signal, and extracts multi-level features by deep convolution layers for capturing auditory attention information in the brain signal.

According to an embodiment of the application, the purpose of the separator used for sound source separation is to predict the speech mask of the target speaker in combination with the embedded features of the audio and EEG signals. This process relies on deep convolution layers and cross-layer attention mechanisms to achieve deep fusion of audio and EEG features.

According to an embodiment of the present application, a decoder for decoding target encoded information reconstructs a speech signal of a target speaker using a mask generated by a separator in combination with audio features. The overall architecture realizes the task of separating and reconstructing voice signals by effectively fusing multi-mode information.

According to an embodiment of the present application, generating an initial loss value for each enhanced speech information from the enhanced speech information and the enhanced language tag includes:

For the first enhanced speech information, according to the enhanced speech informationAnd enhancing the language tag to generate a signal distortion loss value SI-SDR;

For the second enhanced speech information, based on the enhanced speech informationAnd enhancing the language tag to generate a first confusion loss value;

for the third enhanced speech information, based on the enhanced speech informationAnd enhancing the language tag to generate a second confusion loss value;

for the fourth enhanced speech information, according to the enhanced speech informationAnd enhancing the language tag to generate a third confusion loss value.

According to an embodiment of the present application, iteratively adjusting network parameters of an initial enhancement model according to a plurality of initial loss values to obtain a trained speech enhancement model, comprising:

Generating a first loss value according to the signal distortion loss value and the first aliasing loss value;

Generating a target loss value according to the second loss value, the second confusion loss value and the third confusion loss value;

According to an embodiment of the application, a special constraint function is used to train the target selection layer:

Wherein the method comprises the steps ofThe loss function is the SI-SDR,The loss function is for the selection vectorThe square of the loss, i.e. the mean square error loss is chosen,The loss function is the square of the SI-SDR output by three countermeasure branches, and can effectively restrain the model from memorizing the phenomenon of the target speaker.

In a specific embodiment, the application considers the problem of electrode channel redundancy in the brain-assisted speech enhancement field in the related technology, and the cost and the loss are overlarge because of a plurality of electrodes, but the traditional electrode channel selection method cannot effectively realize personalized selection, so that the final enhancement variance is overlarge. Therefore, the application designs a new voice enhancement method and a personalized voice enhancement model to realize the bio-assisted voice enhancement of personalized electrode distribution, and utilizes a new countermeasure training method to solve the problem of abnormal memory of the model. The method comprises the following steps:

(1) Experimental setup

In performance assessment, three objective indicators are used to measure the overall quality of the enhanced speech signal, including SI-SDR (decibel) for assessing signal to distortion ratio, speech quality perception assessment (PESQ), and short-term objective intelligibility (STOI). These metrics enable a comprehensive assessment of the quality, intelligibility and clarity of the enhanced speech. The application also designs an index for measuring the abnormal memory of the model, namely excessive memory (OM).

In pre-training the generic selection model, a generic data set containing 26 subjects was used. An Adam optimizer is used in training, and the learning rate is set to 0.0001. Meanwhile, in order to optimize the Deep Neural Network (DNN) weight, an exponential moving average method was used, and the decay rate was set to 0.999. To obtain a corresponding EEG channel selection, regularization term weight γ is set to 0.05. The total training period for pre-training was 60 rounds with a batch size set to 8.

In the individual self-adaptive training, 1000 rounds of individual variation learning are performed first, and then 1000 rounds of selection fine adjustment are performed. When fine tuning is selected, the model is ensured to stop training at the proper time point by monitoring the loss function of the validation set and setting an early-stop mechanism of 10 rounds. All models had converged before the maximum training round number was reached. Other settings for individual adaptive training are consistent with the pre-training phase.

The training set used for the evaluation of the application comprises 33 testers (28 men and 5 women), the average age is 27.3+/-3.2 years, all testers are normal hearing people with English in the mother language, and no nervous system disease history exists. Since the 6 th subject has poor data quality, its recorded data is excluded.

Each subject participated in 30 experiments, each lasting 60 seconds. The audio stimulus material is a different story that two men read. In each experiment, one story was played through the left ear and the other story was played through the right ear. Half of the subjects were asked to focus on the left-ear story (left-side attention), and the remaining subjects (including excluded subjects) focused on the right-ear story (right-side attention). After each experiment has ended, the subject needs to answer the choice questions to confirm that his attention to the indicated story is valid. To ensure the consistency of the stories, the stories in each experiment succeed the playing progress of the last experiment. In order to reduce other interference with the electroencephalogram, the subject is required to have vision centered on a crosshair in the center of the screen.

During the experiment, the subject worn a 128-channel (plus two mastoid electrodes) EEG cap and the brain electrical signals were recorded using BioSemi ActiveTwo system at a sampling rate of 512 Hz. To keep pace with previous studies, the EEG data is downsampled to 128 Hz. The audio stimulus was played through SENNHEISER HD headphones with a sampling rate of 44.1 kHz. To avoid the impact of audio intensity on attention, the RMS amplitudes of all audio stimuli are normalized.

Both audio and electroencephalographic data are processed to 14.7KHZ, with mixed noise audio being generated from equal intensity mixing of audio of interest and audio not of interest. The training set is divided into three groups, namely 5 experiments are randomly selected from all testees for testing, 2 experiments are selected for verification, and the rest of experiments are used for training. For the training set and validation set, each experiment was cut into 2 second segments, and for the test set, each 60 second experiment was cut into 20 second segments. Since individual adaptive training is required, we further divide the dataset by subject. 3 subjects were randomly selected from each of the left and right attention groups for individual adaptive training. The remaining 26 subjects were used for general training of the model. In the individual adaptive training of the 6 subjects, the data of 26 subjects are inaccessible, and the individual adaptive training is ensured to be independent of the general training data.

(2) Experimental results

Fig. 6 shows the results of comparative analysis of the speech enhancement model of the present application (the "choice of challenge training" in the figure) and the experiment without challenge training model, where the experimental results are shown in the form of violin diagrams, which can well reflect the distribution and range of the experimental results. From the figure, it can be seen that the speech enhancement model of the present application well achieves brain-assisted speech enhancement of personalized electrode distribution, which is comparable in performance to the full channel brain-assisted speech enhancement model and due to the full channel mainstream model UBESD. The model of the application can still maintain the performance advantage after the countermeasure training is introduced, and the variance is well reduced.

According to the embodiment of the application, in table 1, the application evaluates the branches of the countermeasure training, fully verifies the performance of the model of the application, and effectively reduces the abnormal memory phenomenon by combining the three countermeasure branches, and remembers the situation of the target speaker.

TABLE 1 evaluation against three challenge training branches

FIG. 7 shows a flow chart of a training method of a speech enhancement model according to an embodiment of the application.

As shown in FIG. 7, the nerve guidance type voice enhancement method based on personalized brain electrode distribution comprises S701-S702.

In operation S701, in response to a model training instruction, bio-assist data and mixed voice data of a target user are acquired, wherein the bio-assist data includes brain electrical information characterizing a degree of attention of the target user to a certain user of interest in the mixed voice data;

in operation S702, the bio-assist data and the mixed voice data are input to a voice enhancement model, and enhanced voice data in which voice of a user of interest is enhanced is output.

According to the embodiment of the application, in a scene that a target user speaks in multiple persons, the biological auxiliary data such as EEG can be detected in real time through an electrode and the like, the biological auxiliary data and the mixed voice data are processed in real time by a voice enhancement model, and when a certain speaker is focused in the biological auxiliary data of the target user, the voice of the speaker in the mixed voice data can be enhanced by the voice enhancement model, so that the enhanced voice data of the focused user can be output to the target user.

FIG. 8 shows a block diagram of a training apparatus for a speech enhancement model according to an embodiment of the application.

As shown in fig. 8, the training apparatus 800 of the neural guided speech enhancement model based on personalized brain electrode distribution includes a first acquisition module 810, an countermeasure training module 820, a generation module 830, and an adjustment module 840.

A first obtaining module 810, configured to obtain a training set in response to a model training instruction, where the training set includes a plurality of training sample combinations corresponding to at least one training user, the training sample combinations including real biological auxiliary information, mixed distributed auxiliary information, gaussian distributed auxiliary information, mixed training voice information, and enhanced voice tags of the training user, and any auxiliary information includes electroencephalogram information characterizing a degree of attention of the training user to a certain identified user in the mixed training voice information;

The countermeasure training module 820 is configured to perform countermeasure training on the plurality of initial enhancement models by using the real biological auxiliary information, the mixed distribution auxiliary information, the gaussian distribution auxiliary information and the mixed training voice information for each training sample combination, so as to obtain a plurality of enhancement voice information;

a generating module 830, configured to generate, for each enhanced speech information, an initial loss value according to the enhanced speech information and the enhanced language tag;

the adjusting module 840 is configured to iteratively adjust network parameters of the initial enhancement model according to the plurality of initial loss values, to obtain a trained speech enhancement model.

Fig. 9 shows a block diagram of a speech enhancement apparatus according to an embodiment of the application.

As shown in fig. 9, the nerve guidance type voice enhancement device 900 based on personalized brain electrode distribution includes a second acquisition module 910 and an enhancement module 920.

A second obtaining module 910, configured to obtain, in response to a model training instruction, bio-assist data and mixed speech data of a target user, where the bio-assist data includes electroencephalogram information that characterizes a degree of attention of the target user to a user of interest in the mixed speech data;

The enhancement module 920 is configured to input the bio-assist data and the mixed voice data into a voice enhancement model, and output enhanced voice data, wherein voice of a user in question in the enhanced voice data is enhanced.

Any number of the modules, sub-modules, units, sub-units, or at least part of the functionality of any number of the sub-units according to embodiments of the application may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present application may be implemented as a split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the application may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), programmable Logic Array (PLA), system-on-chip, system-on-substrate, system-on-package, application Specific Integrated Circuit (ASIC), or in hardware or firmware in any other reasonable manner of integrating or packaging circuitry, or in any one of, or in any suitable combination of, software, hardware, and firmware. Or one or more of the modules, sub-modules, units, sub-units according to embodiments of the application may be at least partly implemented as computer program modules which, when run, may perform the corresponding functions.

For example, any number of the first acquisition module 810, the countermeasure training module 820, the generation module 830, the adjustment module 840, or the second acquisition module 910, the enhancement module 920 may be combined in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Or at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the application, at least one of the first acquisition module 810, the countermeasure training module 820, the generation module 830, the adjustment module 840, or the second acquisition module 910, the enhancement module 920 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Or at least one of the first acquisition module 810, the countermeasure training module 820, the generation module 830, the adjustment module 840, or the second acquisition module 910, the enhancement module 920 may be at least partially implemented as a computer program module which, when executed, may perform the corresponding functions.

Fig. 10 shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the application. The electronic device shown in fig. 10 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present application.

As shown in fig. 10, an electronic device 1000 according to an embodiment of the present application includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. The processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 1001 may also include on-board memory for caching purposes. The processor 1001 may include a single processing unit or a plurality of processing units for performing different actions of the method flow according to an embodiment of the application.

In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are stored. The processor 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiment of the present application by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of the method flow according to an embodiment of the present application by executing programs stored in the one or more memories.

According to an embodiment of the application, the electronic device 1000 may further comprise an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to the bus 1004. The electronic device 1000 may also include one or more of an input section 1006 including a keyboard, a mouse, etc., an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc., a storage section 1008 including a hard disk, etc., and a communication section 1009 including a network interface card, such as a LAN card, a modem, etc., connected to an input/output (I/O) interface 1005. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to an input/output (I/O) interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

According to an embodiment of the present application, the method flow according to an embodiment of the present application may be implemented as a computer software program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the embodiment of the present application are performed when the computer program is executed by the processor 1001. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the application.

The present application also provides a computer-readable storage medium that may be included in the apparatus/device/system described in the above embodiments, or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present application.

According to an embodiment of the present application, the computer-readable storage medium may be a nonvolatile computer-readable storage medium. Such as, but not limited to, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the application, the computer-readable storage medium may include ROM 1002 and/or RAM 1003 described above and/or one or more memories other than ROM 1002 and RAM 1003.

Embodiments of the present application also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present application, when the computer program product is run on an electronic device, for causing the electronic device to carry out the methods provided by the embodiments of the present application.

The above-described functions defined in the system/apparatus of the embodiment of the present application are performed when the computer program is executed by the processor 1001. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the application.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of signals on a network medium, distributed, and downloaded and installed via the communication section 1009, and/or installed from the removable medium 1011. The computer program may comprise program code that is transmitted using any appropriate network medium, including but not limited to wireless, wireline, etc., or any suitable combination of the preceding.

According to embodiments of the present application, program code for carrying out computer programs provided by embodiments of the present application may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or in assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the application can be combined and/or combined in a variety of ways, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments of the application can be combined and/or combined in various ways without departing from the spirit and teachings of the application. All such combinations and/or combinations fall within the scope of the application.

The embodiments of the present application are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present application. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The present application may be modified in various ways by those skilled in the art without departing from the scope of the application, and such modifications and alterations should fall within the scope of the application.

Claims

1. A training method of a neural guided speech enhancement model based on personalized brain electrode distribution, comprising:

Responding to a model training instruction, acquiring a training set, wherein the training set comprises a plurality of training sample combinations corresponding to at least one training user, the training sample combinations comprise real biological auxiliary information, mixed distribution auxiliary information, gaussian distribution auxiliary information, mixed training voice information and enhanced voice labels of the training user, and any auxiliary information comprises electroencephalogram information representing attention degree of the training user to a certain identification user in the mixed training voice information;

Aiming at each training sample combination, performing countermeasure training on a plurality of initial enhancement models by utilizing the real biological auxiliary information, the mixed distribution auxiliary information, the Gaussian distribution auxiliary information and the mixed training voice information to obtain a plurality of enhancement voice information;

Generating an initial loss value according to the enhanced voice information and the enhanced voice tag for each piece of enhanced voice information;

2. The method of claim 1, wherein using the real biological side information, the mixed distribution side information, the gaussian distribution side information, and the mixed training speech information to counter-train a plurality of initial enhancement models to obtain a plurality of enhanced speech information, comprising:

And training a fourth initial enhancement model by utilizing the mixed training voice information to obtain fourth enhancement voice information.

3. The method of claim 2, wherein generating the enhanced speech information using any of the initial enhancement models comprises:

Processing target auxiliary information by using a target selection module to obtain a biological feature vector, wherein the target auxiliary information comprises the real biological auxiliary information, the mixed distribution auxiliary information or the Gaussian distribution auxiliary information;

4. A method according to claim 3, wherein processing the target side information with a target selection module to obtain a biometric vector comprises:

processing a plurality of biological convolution features by using a linear layer to obtain biological linear features;

5. The method of claim 4, wherein processing the target side information with a plurality of depth separation convolution units results in a plurality of biological convolution features, comprising:

filling the first convolution feature to obtain a second convolution feature;

6. The method of claim 3, wherein processing the biometric vector, the mixed training speech information, and the target auxiliary information with a speech enhancement module to generate the enhanced speech information comprises:

7. The method of claim 6, wherein processing the to-be-processed biometric and the mixed training speech information using a convolutional time-domain separation network to obtain the enhanced speech information comprises:

respectively carrying out coding processing on the biological characteristics to be processed and the mixed training voice information to obtain biological coding characteristics and language coding characteristics;

8. The method of claim 2, wherein generating an initial loss value from the enhanced speech information and the enhanced speech tag for each of the enhanced speech information comprises:

Generating a signal distortion loss value according to the enhanced voice information and the enhanced voice tag aiming at the first enhanced voice information;

Generating a first confusion loss value according to the enhanced voice information and the enhanced voice tag aiming at the second enhanced voice information;

Generating a second confusion loss value according to the enhanced voice information and the enhanced voice tag aiming at the third enhanced voice information;

And generating a third confusion loss value according to the enhanced voice information and the enhanced voice tag aiming at the fourth enhanced voice information.

9. The method of claim 8, wherein iteratively adjusting network parameters of the initial enhancement model based on a plurality of the initial loss values results in a trained speech enhancement model, comprising:

10. A nerve-guided speech enhancement method based on personalized brain electrode distribution, comprising:

Inputting the biological auxiliary data and the mixed voice data into a voice enhancement model, and outputting enhanced voice data, wherein the voice of the concerned user in the enhanced voice data is enhanced;

wherein the speech enhancement model is trained by the method of any one of claims 1-9.