CN114974245B

Movatterモバイル変換

Info

Publication number: CN114974245B
Application number: CN202210609847.4A
Authority: CN
Inventors: 胡玉祥; 朱长宝; 余凯
Original assignee: Nanjing Horizon Robotics Technology Co Ltd
Current assignee: Nanjing Horizon Robotics Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2025-07-01
Anticipated expiration: 2042-05-31
Also published as: CN114974245A

Abstract

The embodiment of the disclosure discloses a voice separation method and device, electronic equipment and storage medium, wherein the method comprises the steps of obtaining a first mixed voice signal and a first image sequence in a space area, detecting the image quality of the first image sequence, determining the image quality of the first image sequence, processing the input first mixed voice signal and the first image sequence by using a first voice separation model to obtain a first voice signal in response to the image quality of the first image sequence meeting a preset standard, and processing the first mixed voice signal by using a second voice separation model to obtain a second voice signal in response to the image quality of the first image sequence not meeting the preset standard. According to the embodiment of the disclosure, the first mixed voice signal can be subjected to voice separation, the person to which the separated voice signal belongs can be determined, whether to respond to the control instruction of the separated voice signal to the vehicle-mounted equipment or not can be determined according to the authority information, and the user experience is good.

Description

Voice separation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of vehicle technology and the field of speech processing technology, and in particular, to a speech separation method and apparatus, an electronic device, and a storage medium.

Background

With the development of voice-based control technology and vehicle technology, a manner of controlling the in-vehicle apparatus by voice has emerged.

In order to facilitate control of the vehicle by multiple users in the vehicle, it is necessary to separately acquire the voice signals of different passengers in the vehicle. In the related art, separation of voices is performed by means of blind source separation (BSS, blind Source Separation). Blind source separation refers to a process of separating each sound source signal from an aliased signal (observed signal) in the case where the theoretical model of the signal and the sound source signal are not precisely known. The existing blind source separation method cannot determine the corresponding relation between the separated voice signals and passengers in the vehicle, so that effective management of controlling different vehicle-mounted devices by different passengers cannot be realized.

Disclosure of Invention

At present, the mixed voice signal is subjected to voice separation in a blind source separation mode, and voice without people can be obtained after separation. Because the person to which the voice signal belongs is not clear, different authority control cannot be performed on the separated voice signal, and whether the control content of the vehicle-mounted equipment responding to the voice signal is not clear or not is not clear, so that the user experience is poor.

The present disclosure has been made in order to solve the above technical problems. The embodiment of the disclosure provides a voice separation method and device, electronic equipment and a storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a voice separation method, including:

Acquiring a first mixed voice signal and a first image sequence in a space region, wherein the first mixed voice signal comprises a voice signal of a first person and a voice signal of a second person, and the first image sequence is an image sequence which is acquired in the space region and comprises the person in the space;

performing image quality detection on the first image sequence, and determining the image quality of the first image sequence;

Responding to the image quality of the first image sequence meeting a preset standard, and processing the input first mixed voice signal and the first image sequence by using a first voice separation model to obtain a first voice signal, wherein the first voice signal comprises at least one voice signal separated from the mixed voice signal;

And processing the first mixed voice signal by using a second voice separation model to obtain a second voice signal, wherein the second voice signal comprises at least one voice signal separated from the mixed voice signal.

According to a second aspect of embodiments of the present disclosure, there is provided a voice separation apparatus, including:

The system comprises an acquisition module, a first image acquisition module and a second image acquisition module, wherein the acquisition module is used for acquiring a first mixed voice signal and a first image sequence in a space area, the first mixed voice signal comprises a voice signal of a first person and a voice signal of a second person, and the first image sequence is an image sequence which is acquired in the space area and comprises the person in the space;

an image quality determining unit, configured to perform image quality detection on the first image sequence, and determine image quality of the first image sequence;

the first processing unit is used for responding to the fact that the image quality of the first image sequence meets a preset standard, and processing the input first mixed voice signal and the first image sequence by using a first voice separation model to obtain a first voice signal, wherein the first voice signal comprises at least one voice signal separated from the first mixed voice signal;

And the second processing unit is used for processing the first mixed voice signal by using a second voice separation model to obtain a second voice signal in response to the image quality of the first image sequence not meeting the preset standard, wherein the second voice signal comprises at least one voice signal separated from the first mixed voice signal.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for performing the speech separation method according to the first aspect described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device comprising:

A processor;

a memory for storing the processor-executable instructions;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the voice separation method according to the first aspect.

According to the voice separation method, the voice separation device, the electronic equipment and the storage medium, which are provided by the embodiment of the disclosure, after the first mixed voice signal and the first image sequence in the space area (such as a cockpit) are obtained, the image quality of the first image sequence is detected to determine the image quality of the first image sequence, and according to whether the image quality of the first image sequence meets the preset standard, the first mixed voice signal is subjected to targeted voice separation by using the first voice separation model or the second voice separation model correspondingly, so that separated multipath voice signals can be obtained, characters belonging to the multipath voice signals can be determined, at least one path of voice signals can be output from the multipath voice signals, and further whether the vehicle-mounted equipment is controlled to respond to voice instructions of the at least one path of voice signals can be determined according to authority information of the characters belonging to the at least one path of voice signals.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a flow diagram of a method of speech separation in one embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of step S2 in an embodiment of the disclosure;

FIG. 3 is a schematic flow chart of step S4 in one embodiment of the present disclosure;

FIG. 4 is a block diagram of a voice separation apparatus in one embodiment of the present disclosure;

FIG. 5 is a block diagram of the image quality determination module 200 in one embodiment of the present disclosure;

FIG. 6 is a block diagram of a second processing module 400 in one embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate that a exists alone, and a and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure may be applicable to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices such as terminal devices, computer systems, servers, etc., include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputers systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Exemplary overview

An image acquisition device and a voice acquisition device are arranged in a designated space region, mixed audio signals and image sequences in the designated space region are respectively acquired through the image acquisition device and the audio acquisition device, for example, a vehicle-mounted camera and a vehicle-mounted microphone array are arranged in a cockpit, and in-vehicle mixed audio signals and in-vehicle image sequences are respectively acquired through the vehicle-mounted camera and the vehicle-mounted microphone array.

After the mixed audio signal in the specified spatial region is obtained, a first mixed speech signal including a speech signal of a first person and a speech signal of a second person may be separated from the mixed audio signal based on audio characteristics of the mixed audio signal after noise (for example, wind noise or mechanical noise) of background noise in the mixed audio signal may be reduced.

And detecting the image quality of the first image sequence by using a preset standard, and determining the image quality of the first image sequence. When the image quality of the first image sequence does not meet the preset standard, the first mixed voice signal is difficult to carry out voice separation in an auxiliary mode through the first image sequence, and therefore the first mixed voice signal is processed through a second voice separation model comprising a blind source separation model and a character sound source model, and a second voice signal is obtained. The first voice signal and the second voice signal can be used for determining the person to which the first voice signal and the second voice signal belong, and whether to respond to the control instruction of the first voice signal or the second voice signal to the vehicle-mounted equipment or not can be determined according to the authority information of the person to which the first voice signal and the second voice signal belong, so that user experience is good.

Exemplary method

Fig. 1 is a flow diagram of a method of speech separation in one embodiment of the present disclosure. As shown in fig. 1, the method comprises the following steps:

S1, acquiring a first mixed voice signal and a first image sequence in a space region. The first mixed voice signal includes the voice signal of the first person and the voice signal of the second person, and may further include the voice signals of other persons in the space region. The first image sequence is an image sequence comprising a person in space acquired in a region of space.

An image acquisition device and a voice acquisition device are arranged in a space region, and a first mixed audio signal and a first image sequence in the appointed space region are respectively acquired through the image acquisition device and the audio acquisition device. The space area can be a cockpit, the image acquisition device can comprise a vehicle-mounted camera, the audio acquisition device can comprise a vehicle-mounted microphone array, and the vehicle-mounted camera and the vehicle-mounted microphone array can be used for respectively acquiring in-vehicle mixed audio signals and in-vehicle image sequences.

After the mixed audio signal in the specified spatial region is obtained, background noise (e.g., wind noise or mechanical noise) in the mixed audio signal may be reduced, and the first mixed speech signal may be separated from the mixed audio signal based on the audio characteristics of the mixed audio signal.

And S2, performing image quality detection on the first image sequence, and determining the image quality of the first image sequence.

The preset criteria may include criteria for the image signal dimension, and may also include criteria for the image quality dimension. And detecting the image quality of the first image sequence by using a preset standard, determining whether the first image sequence meets the preset standard or not, and further determining the image quality of the first image sequence according to the condition that the first image sequence meets the preset standard.

And S3, responding to the image quality of the first image sequence meeting the preset standard, and processing the input first mixed voice signal and the first image sequence by using a first voice separation model to obtain a first voice signal. The first voice signal comprises at least one voice signal separated from the first mixed voice signal.

When the image quality of the first image sequence meets the preset standard, the first mixed voice signal and the first image sequence are used as the input of a pre-trained first voice separation model, and the first voice separation model is utilized to process the first mixed voice signal and the first image sequence, so that the first voice signal is obtained.

Wherein, before step S3, a mixed voice signal including a plurality of persons may be collected for a predetermined period of time as a sample mixed voice signal for a spatial region. And acquiring an image sequence including the plurality of people acquired for the spatial region within the preset time period as a sample image sequence. A first speech separation model is trained based on the sample mixed speech signal and the sample image sequence. Wherein the first speech separation model may perform image recognition on the sequence of sample images to determine speaking times and speaking content of the plurality of persons over a predetermined period of time. The first speech separation model may further perform speech separation on the sample speech signal based on speech features of different speech signals in the sample mixed speech signal, to obtain a plurality of paths of speech signals. Further, the first speech separation model may determine a sound source object of the multi-path speech signal based on speaking times and speaking contents of the plurality of persons within a predetermined period of time, thereby determining the person to which the multi-path speech signal belongs.

And S4, processing the first mixed voice signal by using a second voice separation model to obtain a second voice signal, wherein the second voice signal comprises at least one voice signal separated from the first mixed voice signal, in response to the image quality of the first image sequence not meeting a preset standard.

When the image quality of the first image sequence does not meet the preset standard, the image recognition result of the first image sequence is difficult to assist in determining the characters of the multipath voice signals after the first mixed voice signals are separated, so that the second voice signals can be obtained by processing the first mixed voice signals by using the second voice separation model. Wherein the second speech separation model may comprise a blind source separation model and a sound source model for determining a multi-path speech signal sound source object. The first mixed speech signal may be separated into multiple speech signals using a blind source separation model, then the person to which the multiple speech signals belong may be determined using a sound source model, and then the first speech signal may be output, and the person to which the first speech signal belongs may be output.

In this embodiment, after the first mixed speech signal and the first image sequence in the spatial region are acquired, image quality detection is performed on the first image sequence, and image quality of the first image sequence is determined. According to whether the image quality of the first image sequence meets a preset standard, the first voice separation model or the second voice separation model is correspondingly used for carrying out targeted voice separation on the first mixed voice signals, multiple separated voice signals can be obtained, at least one voice signal is output from the multiple voice signals, and further whether to control the vehicle-mounted equipment to respond to voice instructions of the at least one voice signal output can be determined according to authority information of people to which the at least one voice signal belongs.

Fig. 2 is a schematic flow chart of step S2 in an embodiment of the disclosure. As shown in fig. 2, step S2 includes:

s2-1, acquiring an image signal corresponding to the first image sequence, and determining the image signal quality of the image signal.

The signal intensity of the image signal corresponding to the first image sequence can be detected, and the image signal quality is determined according to the comparison result of the detected signal intensity and a preset signal intensity threshold value. For example, when the detected signal strength is greater than a signal strength threshold, determining that the image signal strength meets the image signal quality criteria, and when the detected signal strength is less than or equal to the signal strength threshold, determining that the image signal strength does not meet the image signal quality criteria.

S2-2, determining the image content quality of the first image sequence based on each image frame of the first image sequence.

Image recognition may be performed on each image frame of the first image sequence, and the image content quality of the first image sequence may be determined based on the image recognition result. For example, when the image recognition result can assist in determining the person to whom the first mixed speech signal is separated, determining that the image content quality meets the preset image content quality standard, and when the image recognition result cannot assist in determining the person to which the first mixed speech signal is separated, determining that the image content quality does not meet the preset image content quality standard.

S2-3, determining the image quality of the first image sequence based on the image signal quality and the image content quality.

It may be set that when any one of the image signal quality and the image content quality does not meet the corresponding quality criterion, that is, it is determined that the image quality of the first image sequence does not meet the preset criterion. For example, when the image signal quality does not meet the image signal quality criterion, steps S2-2 to S2-3 are not entered, and it is directly determined that the image quality of the first image sequence does not meet the preset criterion.

In this embodiment, the image signal quality and the image content quality of the first image sequence may be determined by detecting the first image sequence, and the image quality of the first image sequence may be effectively represented by two detection dimensional qualities of the image signal quality and the image content quality.

In one embodiment of the present disclosure, step S2-2 specifically includes determining a lip occlusion state of the first person and/or the second person in each image frame based on each image frame of the first image sequence, and determining an image content quality based on the lip occlusion state.

The lip occlusion state may include the lip being unoccluded and the lip being occluded. When the lips are blocked, whether the first person and/or the second person speak or not can not be determined through the lip images, the speaking time of the first person and/or the second person can not be determined, and the person of the multipath voice signals after the first mixed voice signals are separated can not be determined based on the speaking time.

When the lip occlusion state of the first person and/or the second person in each image frame is that the lip is occluded, determining that the image content quality of the first image sequence does not meet the image content quality standard.

When the lip shielding state of the first person and/or the second person in each image frame is that the lips are not shielded, lip images of the first person and/or the second person in each image frame can be obtained, recognition is carried out based on the lip images, if the recognition result of the lip images can determine to take speaking time and speaking content of the first person and/or the second person, it can be determined that the image content quality of the first image sequence meets the image content quality standard.

In this embodiment, based on each image frame of the first image sequence, a lip occlusion state of the first person and/or the second person in each image frame may be determined, and based on the lip occlusion state, whether the image content quality of the first image sequence meets the image content quality standard may be quickly determined, and further whether the image quality of the first image sequence meets the preset standard may be quickly determined, so that it may be quickly determined whether to select the first speech separation model or select the second speech separation model for speech separation.

In one embodiment of the present disclosure, step S2-3 includes determining that the image quality of the first image sequence does not meet the preset criteria in response to the image signal quality not meeting the image signal quality criteria, determining that the image quality of the first image sequence does not meet the preset criteria in response to the image content quality not meeting the image content quality criteria, and determining that the image quality of the first image sequence meets the preset criteria in response to the image signal quality meeting the image signal quality criteria and the image content quality meeting the image content quality criteria.

In this embodiment, when the image signal quality does not meet the image signal quality standard, the definition of the first image sequence generated based on the image signal is generally insufficient, and it is difficult to analyze the belonging person of the multiple voice signals separated by the first mixed voice signal based on the first image sequence. When the image content quality does not meet the image content quality standard, the speaking time and speaking content of the first person and/or the second person cannot be obtained, and the person of the multipath voice signal separated by the first mixed voice signal cannot be determined. Therefore, only when the image signal quality and the image internal quality meet the corresponding quality standards, the image quality of the first image sequence can be determined to meet the preset standards, and then the characters of the multipath voice signals after the first mixed voice signals are separated can be effectively determined based on the identification result of the first image sequence. When any one of the image signal quality and the intra-image quality does not meet the corresponding quality standard, it can be quickly determined that the image quality of the first image sequence does not meet the preset standard.

In one embodiment of the present disclosure, step S2-2 includes determining a lip motion of the first person and/or the second person based on each image frame of the first image sequence in response to the lip occlusion status being that the lips of the first person and/or the second person are not occluded, and determining that the image quality of the first image sequence does not meet the image content quality criteria in response to the lip motion not meeting the preset lip motion criteria.

If the lips of the first person and/or the second person are not blocked in each image frame of the first image sequence, a lip image block sequence of the first person and/or the second person in the first image sequence can be obtained, and the lip actions of the first person and/or the second person can be obtained by identifying based on the lip image block sequence.

And acquiring a preset lip action standard which can be used for lip language identification. The lip motion standard may include, for example, that the upper and lower rows of teeth are not in contact when the person speaks, so that the lip motion that eats food can be filtered. Lip action behaviors of non-person speaking can be filtered out by the preliminary action criteria.

When the lip motion does not meet the preset lip motion standard, effective lip recognition is difficult to be performed based on the first image sequence, and therefore the characters of the multipath voice signals separated by the first mixed voice signals cannot be accurately determined.

In this embodiment, when the lips of the first person and/or the second person in each image frame of the first image sequence are not blocked, the lip motion of the first person and/or the second person may be determined based on each image frame of the first image sequence, and when the lip motion does not meet the preset lip motion standard, it is difficult to characterize the person to which the multi-path voice signal separated by the first mixed voice signal belongs, because effective lip recognition is difficult to be performed based on the first image sequence.

In one embodiment of the present disclosure, the second speech separation model includes a first persona sound source model, a second persona sound source model, and a blind source separation model. The blind source separation model is used for carrying out blind source separation on the first mixed voice signals, the first person sound source model is used for determining voice signals of a first person based on blind source separation results, and the second person sound source model is used for determining voice signals of a second person based on blind source separation results.

When the second speech separation model is used to process the first mixed speech signal in step S4, the blind source separation model is used to perform speech separation on the first mixed speech signal, so as to obtain multiple paths of speech signals. Wherein, each person's voice signal corresponds one voice signal, and multichannel voice signal includes the voice signal of first person and the voice signal of second person at least. When the first mixed speech signal further includes speech signals of other persons, the multiple speech signals further include speech signals of other persons.

After the multipath voice signals are obtained, the multipath voice signals are respectively matched with the sound source characteristics of the first character sound source model and the sound source characteristics of the second character sound source model, so that the sound source object of which voice signal is the first character and the sound source object of which voice signal is the second character are determined.

After determining one voice signal corresponding to the first person and one voice signal corresponding to the second person, determining a second voice signal based on one voice signal corresponding to the first person and/or one voice signal corresponding to the second person, and then outputting the second voice signal.

In this embodiment, when the image quality of the first image sequence does not meet the preset standard, it is difficult to determine the person belonging to the multiple paths of voice signals after the first mixed voice signals are separated in an assisted manner through the image recognition result of the first image sequence, so that the multiple paths of voice signals can be obtained by performing voice separation on the first mixed voice signals by using the blind source separation model, and sound source feature matching can be performed on the multiple paths of voice signals by using the first person sound source model, the second person sound source model and the blind source separation model, so that it is possible to determine which path of voice signals has the first person as the sound source object and which path of voice signals has the second person as the sound source object, and thus it is possible to implement accurate voice signal separation and sound source object matching on the first mixed voice signals.

In one embodiment of the present disclosure, before processing the first mixed speech signal with the second speech separation model to obtain the second speech signal, further comprising:

And A, processing the second mixed voice signal and the second image sequence based on the first voice separation model to obtain a sound source signal of the first person and a sound source signal of the second person. The collection time of the second mixed voice signal is earlier than that of the first mixed voice signal. The acquisition time of the second image sequence is earlier than the acquisition time of the first image sequence. The second mixed speech signal includes a sound source signal of the first person and a sound source signal of the second person. The second image sequence is an image sequence comprising a person in space acquired in the region of space.

Because the first voice separation model can perform voice separation in a voice channel separation and lip language recognition combined mode, the first voice separation model can be utilized to accurately perform voice separation and sound source object matching on the second mixed voice signal, and therefore the sound source signal of the first person and the sound source signal of the second person can be obtained.

And B, performing online modeling based on the sound source signals of the first person and the second person to obtain a first person sound source model and a second person sound source model.

The sound source characteristics of the sound source signals of the first person can be extracted, and then the first person sound source model can be trained in an online modeling scoring mode. And the sound source characteristics of the sound source signals of the second person can be extracted, and then the second person sound source model can be trained in an online modeling scoring mode.

In this embodiment, the first speech separation model is used to accurately perform speech separation and sound source object matching on the second mixed speech signal, so that a sound source signal of the first person and a sound source signal of the second person can be obtained, online modeling is performed based on the sound source signal of the first person and the sound source signal of the second person, and the first person sound source model and the second person sound source model can be quickly obtained.

Fig. 3 is a flow chart of step S4 in one embodiment of the present disclosure. As shown in fig. 3, step S4 includes:

S4-1, based on the first image sequence, identity information of the first person and/or the second person is determined.

The face characteristics of the first person and the second person can be pre-stored, and the identity information of the first person and/or the second person can be determined in a face characteristic matching mode.

S4-2, based on the identity information, acquiring a first character sound source model and/or a second character sound source model matched with the identity information, and acquiring a blind source separation model.

After the identity information of the first person and/or the second person is determined, the first person sound source model and/or the second person sound source model in the second voice separation model can be called up, and the blind source separation model can be called up, because the first person sound source model and the second person sound source model are trained in advance.

S4-3, processing the first mixed voice signal based on the first character sound source model and/or the second character sound source model and the blind source separation model to obtain a second voice signal.

And performing voice separation on the first mixed voice signal by using the blind source separation model to obtain multiple paths of voice signals, and performing sound source characteristic matching on the multiple paths of voice signals by using the first person sound source model, the second person sound source model and the blind source separation model, so that the sound source object of which voice signal is the first person and the sound source object of which voice signal is the second person can be determined, and further the second voice signal is obtained.

In this embodiment, identity information of the first person and/or the second person may be identified by the first image sequence, so as to call the first person sound source model and/or the second person sound source model, and call the blind source separation model, so that when the image quality of the first image sequence does not meet the preset standard, the first mixed speech signal may be accurately speech separated by combining the blind source separation model with the sound source object model.

Any of the voice separation methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including, but not limited to, terminal devices, servers, and the like. Or any of the voice separation methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the voice separation methods mentioned by the embodiments of the present disclosure by invoking corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary apparatus

Fig. 4 is a block diagram of a voice separation apparatus in one embodiment of the present disclosure. As shown in fig. 4, the voice separation apparatus includes:

An obtaining module 100, configured to obtain a first mixed speech signal and a first image sequence in a spatial area, where the first mixed speech signal includes a speech signal of a first person and a speech signal of a second person, and the first image sequence is an image sequence collected in the spatial area and including the person in the space;

an image quality determining module 200, configured to perform image quality detection on the first image sequence, and determine image quality of the first image sequence;

The first processing module 300 is configured to process, by using a first speech separation model, the input first mixed speech signal and the first image sequence to obtain a first speech signal in response to the image quality of the first image sequence meeting a preset criterion, where the first speech signal includes at least one speech signal separated from the first mixed speech signal;

and the second processing module 400 is configured to process the first mixed speech signal by using a second speech separation model to obtain a second speech signal in response to the image quality of the first image sequence not meeting the preset standard, where the second speech signal includes at least one speech signal separated from the first mixed speech signal.

Fig. 5 is a block diagram of the image quality determination module 200 in one embodiment of the present disclosure. As shown in fig. 5, the image quality determining module 200 includes:

an image signal quality determining unit 210, configured to obtain an image signal corresponding to the first image sequence, and determine an image signal quality of the image signal;

An image content quality determining unit 220 for determining an image content quality of the first image sequence based on each image frame of the first image sequence;

an image quality determining unit 230 for determining an image quality of the first image sequence based on the image signal quality and the image content quality.

In one embodiment of the present disclosure, the image content quality determining unit 220 is configured to determine a lip occlusion state of the first person and/or the second person in each image frame based on each image frame of the first image sequence, and the image content quality determining unit 220 is further configured to determine the image content quality based on the lip occlusion state.

In one embodiment of the present disclosure, the image content quality determining unit 220 is configured to determine that the image quality of the first image sequence does not meet the preset criterion in response to the image signal quality not meeting the image signal quality criterion, the image content quality determining unit 220 is further configured to determine that the image quality of the first image sequence does not meet the preset criterion in response to the image content quality not meeting the image content quality criterion, and the image content quality determining unit 220 is further configured to determine that the image quality of the first image sequence meets the preset criterion in response to the image signal quality meeting the image signal quality criterion and the image content quality meeting the image content quality criterion.

In one embodiment of the present disclosure, the image content quality determining unit 220 is configured to determine, based on each image frame of the first image sequence, a lip motion of the first person and/or the second person in response to the lip occlusion state being that the lips of the first person and/or the second person are not occluded, and the image content quality determining unit 220 is further configured to determine, in response to the lip motion not meeting a preset lip motion criterion, that the image quality of the first image sequence does not meet the image content quality criterion.

In one embodiment of the present disclosure, the second speech separation model includes a first human sound source model for blind source separation of the first mixed speech signal, a second human sound source model for determining a speech signal of the first human based on a result of the blind source separation, and a blind source separation model for determining a speech signal of the second human based on a result of the blind source separation.

In one embodiment of the disclosure, the second processing module 400 is specifically configured to process, based on the first speech separation model, a second mixed speech signal and a second image sequence to obtain a sound source signal of the first person and a sound source signal of the second person, where the second mixed speech signal has an acquisition time earlier than an acquisition time of the first mixed speech signal, and the second image sequence has an acquisition time earlier than an acquisition time of the first image sequence, where the second mixed speech signal includes the sound source signal of the first person and the sound source signal of the second person, and the second image sequence is an image sequence including a person in space acquired in the spatial region, and the second processing module 400 is further configured to perform online modeling based on the sound source signal of the first person and the sound source signal of the second person to obtain the first person object model and the sound source model of the second person.

Fig. 6 is a block diagram of a second processing module 400 in one embodiment of the present disclosure. As shown in fig. 6, the second processing module 400 includes:

an identity information determining module 410 configured to determine identity information of the first person and/or the second person based on the first image sequence;

A model obtaining unit 420, configured to obtain the first character sound source model and/or the second character sound source model that are matched with the identity information, and obtain the blind source separation model based on the identity information;

The speech signal determining unit 430 is configured to process the first mixed speech signal based on the first human sound source model and/or the second human sound source model and the blind source separation model to obtain the second speech signal.

It should be noted that, the specific implementation manner of the voice separation apparatus in the embodiment of the present disclosure is similar to the specific implementation manner of the voice separation method in the embodiment of the present disclosure, and specific reference is made to the voice separation method section, so that redundancy is reduced and redundant description is omitted.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present disclosure is described with reference to fig. 7. As shown in fig. 7, the electronic device includes one or more processors 10 and a memory 20.

The processor 10 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 20 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 11 to implement the speech separation methods and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device may further include an input device 30 and an output device 40, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 30 may be, for example, a keyboard, a mouse, etc. Output device 40 may include, for example, a display, speakers, a printer, and a communication network and remote output apparatus connected thereto, etc.

Of course, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 7 for simplicity, components such as buses, input/output interfaces, and the like being omitted. In addition, the electronic device may include any other suitable components depending on the particular application.

Exemplary computer-readable storage Medium

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of speech separation comprising:

2. The speech separation method of claim 1, wherein the performing image quality detection on the first image sequence to determine the image quality of the first image sequence comprises:

acquiring an image signal corresponding to the first image sequence, and determining the image signal quality of the image signal;

Determining an image content quality of the first image sequence based on each image frame of the first image sequence;

an image quality of the first image sequence is determined based on the image signal quality and the image content quality.

3. The method of claim 2, wherein the determining the image content quality of the first image sequence based on the image frames of the first image sequence comprises:

Determining a lip occlusion state of the first person and/or the second person in each image frame based on each image frame of the first image sequence;

and determining the image content quality based on the lip occlusion state.

4. A method according to claim 3, wherein said determining an image quality of said first image sequence based on said image signal quality and said image content quality comprises:

determining that the image quality of the first image sequence does not meet the preset standard in response to the image signal quality not meeting the image signal quality standard;

Determining that the image quality of the first image sequence does not meet the preset standard in response to the image content quality not meeting the image content quality standard;

In response to the image signal quality meeting the image signal quality criteria and the image content quality meeting the image content quality criteria, determining that the image quality of the first image sequence meets the preset criteria.

5. The method of claim 4, the determining an image content quality of the first image sequence based on each image frame of the first image sequence, further comprising:

Determining lip actions of the first person and/or the second person based on the image frames of the first image sequence in response to the lip occlusion status being that the lips of the first person and/or the second person are not occluded;

and in response to the lip motion not meeting a preset lip motion criterion, determining that the image quality of the first image sequence does not meet the image content quality criterion.

6. The speech separation method of claim 1, the second speech separation model comprising a first persona sound source model for blind source separation of the first mixed speech signal, a second persona sound source model for determining a speech signal of the first persona based on a result of the blind source separation, and a blind source separation model for determining a speech signal of the second persona based on a result of the blind source separation.

7. The speech separation method of claim 6, further comprising, prior to said processing the first mixed speech signal with the second speech separation model to obtain a second speech signal:

and performing online modeling based on the historical sound source signals of the first person and the historical sound source signals of the second person, which are separated by the first voice separation model, so as to obtain the first person sound source model and the second person sound source model.

8. The method of claim 7, wherein the processing the first mixed speech signal using a second speech separation model to obtain a second speech signal comprises:

determining identity information of the first person and/or the second person based on the first image sequence;

Based on the identity information, acquiring the first character sound source model and/or the second character sound source model matched with the identity information, and acquiring the blind source separation model;

and processing the first mixed voice signal based on the first human sound source model and/or the second human sound source model and the blind source separation model to obtain the second voice signal.

9. A speech separation device comprising:

The image quality determining module is used for detecting the image quality of the first image sequence and determining the image quality of the first image sequence;

The first processing module is used for processing the input first mixed voice signal and the first image sequence by utilizing a first voice separation model to obtain a first voice signal in response to the image quality of the first image sequence meeting a preset standard, wherein the first voice signal comprises at least one voice signal separated from the first mixed voice signal;

And the second processing module is used for processing the first mixed voice signal by utilizing a second voice separation model to obtain a second voice signal in response to the image quality of the first image sequence not meeting the preset standard, wherein the second voice signal comprises at least one voice signal separated from the first mixed voice signal.

10. A computer readable storage medium storing a computer program for performing the speech separation method of any one of the preceding claims 1-8.

11. An electronic device, the electronic device comprising:

A processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the speech separation method according to any of the preceding claims 1-8.