KR20240125562A

Movatterモバイル変換

Info

Publication number: KR20240125562A
Application number: KR1020247017586A
Authority: KR
Inventors: 라 레이 에밀 데; 파리 스마라그디스
Original assignee: 윙넛 필름스 프로덕션스 리미티드
Priority date: 2021-10-27
Filing date: 2022-10-27
Publication date: 2024-08-19
Also published as: JP2024540243A; KR20240125563A; AU2022374886A1; EP4423745A1; EP4423747A1; AU2022377385A1; KR20240125564A; EP4423746A1; KR20240125561A; JP2024540244A; AU2022379024A1; WO2023073598A1; JP2024540239A; EP4423744A1; AU2022374615A1; WO2023073597A1; WO2023073596A1; WO2023073595A1; JP2024540242A

Abstract

Translated fromKorean

오디오 소스 분리를 위한 시스템들 및 방법들은 복수의 오디오 소스들로부터 생성된 오디오 신호들의 믹스처를 포함하는 오디오 입력 스트림을 수신하는 것; 훈련된 오디오 소스 분리 모델을 통해, 오디오 입력 스트림을 처리하여 복수의 오디오 소스 중 하나 이상에 대응하는 복수의 오디오 스템들을 생성하는 것; 자기 반복적 처리 및 훈련 시스템을 사용하여 복수의 오디오 스템들에 적어도 부분적으로 기초하여 오디오 소스 분리 모델을 업데이트하는 것; 및 업데이트된 훈련된 오디오 소스 분리 모델을 사용하여 오디오 입력 스트림을 재처리하여 복수의 향상된 오디오 스템들을 생성하는 것을 포함한다.Systems and methods for audio source separation include receiving an audio input stream comprising a mixture of audio signals generated from a plurality of audio sources; processing the audio input stream via a trained audio source separation model to generate a plurality of audio stems corresponding to one or more of the plurality of audio sources; updating the audio source separation model based at least in part on the plurality of audio stems using a self-recurrent processing and training system; and reprocessing the audio input stream using the updated trained audio source separation model to generate a plurality of enhanced audio stems.

Description

Translated fromKorean

오디오 소스 분리 시스템들 및 방법들Audio source separation systems and methods

관련 출원들에 대한 상호 참조Cross-reference to related applications

본 개시는 2021년 10월 27일에 출원된 미국 가특허 출원 번호 63/272,650 및 2022년 6월 23일에 출원된 미국 특허 출원 번호 17/848,341에 대한 우선권 및 이익을 주장하며, 이 출원들은 모든 목적들을 위해 이 출원에 전체 내용이 명시된 것처럼 참조로 본원에 포함된다.This disclosure claims priority to and the benefit of U.S. Provisional Patent Application No. 63/272,650, filed October 27, 2021, and U.S. Patent Application No. 17/848,341, filed June 23, 2022, which are incorporated herein by reference as if fully set forth herein for all purposes.

분야field

본 개시는 일반적으로 오디오 소스 분리를 위한 시스템들 및 방법들에 관한 것이며, 특히 오디오 믹스처로부터 오디오 소스 신호들을 분리하고 향상시키기 위한 시스템들 및 방법들에 관한 것이다.The present disclosure relates generally to systems and methods for audio source separation, and more particularly to systems and methods for separating and enhancing audio source signals from an audio mixture.

오디오 믹싱은 여러 오디오 리코딩들을 결합하여 모노, 스테레오 또는 서라운드 사운드와 같은 하나 이상의 원하는 사운드 포맷들로 재생하기 위한 최적화된 믹스처를 생성하는 프로세스이다. 음악 및 영화의 사운드 제작과 같이 고품질 사운드 제작이 요구되는 응용들에서, 오디오 믹스처는 일반적으로 별도의 고품질 리코딩들을 믹싱하여 생성된다. 이러한 별도의 리코딩들은 최적화된 음향들 및 고품질 리코딩 장비를 갖춘 리코딩 스튜디오와 같은 통제된 환경에서 생성되는 경우가 많다.Audio mixing is the process of combining multiple audio recordings to produce an optimized mix for playback in one or more desired sound formats, such as mono, stereo, or surround sound. In applications that require high-quality sound production, such as music and film sound production, audio mixes are typically produced by mixing separate high-quality recordings. These separate recordings are often produced in a controlled environment, such as a recording studio, equipped with optimized acoustics and high-quality recording equipment.

종종, 소스 오디오의 일부는 품질이 낮을 수 있고 및/또는 원하는 오디오 소스들과 원치 않는 노이즈의 믹스처를 포함할 수 있다. 현대 오디오 포스트 프로덕션에서는 원본 리코딩들에서 원하는 품질이 나오지 않을 경우 오디오를 다시 리코딩하는 것이 일반적이다. 예를 들어, 음악 리코딩에서 보컬 또는 악기 트랙이 리코딩되어 이전의 리코딩들과 혼합될 수 있다. 영화 사운드 포스트 프로덕션에서는 배우들을 스튜디오로 데려와 이들의 대화를 다시 리코딩하고 다른 오디오(예컨대 사운드 효과들, 음악)를 믹스에 추가하는 것이 일반적이다.Often, some of the source audio may be of low quality and/or contain a mixture of desired audio sources and unwanted noise. In modern audio post-production, it is common to re-record audio if the original recordings do not produce the desired quality. For example, in music recordings, vocal or instrument tracks may be recorded and mixed with previous recordings. In film sound post-production, it is common to bring actors back to the studio, re-record their dialogue, and add other audio (e.g. sound effects, music) to the mix.

그러나 일부 응용들에서는 원 오디오 소스를 고품질 오디오 믹스로 충실하게 변환하는 것이 바람직하다. 예를 들어, 영화, 음악, 텔레비전 방송들 및 다른 오디오 리코딩들은 100년 이상으로 거슬러 올라갈 수 있다. 소스 오디오는 오래된 저품질 장비에 리코딩되었을 수 있고 원하는 오디오와 노이즈의 저품질 믹스처를 포함할 수 있다. 많은 리코딩들에서, 단일 트랙/모노 오디오 믹스는 최신 사운드 시스템들에서 재생하기 위한 최적화된 믹스처를 생성하는 데 이용 가능한 유일한 오디오 소스이다.However, in some applications it is desirable to faithfully convert the original audio source into a high-quality audio mix. For example, movies, music, television broadcasts, and other audio recordings may date back over 100 years. The source audio may have been recorded on older, lower-quality equipment and may contain a low-quality mixture of the desired audio and noise. For many recordings, a single-track/mono audio mix is the only audio source available to create an optimized mix for playback on modern sound systems.

오디오 믹스처를 처리하기 위한 한 가지 접근 방식은 오디오 믹스처를 별도의 오디오 소스 구성요소들의 세트로 분리하여 오디오 믹스처의 각 구성요소에 대해 별도의 오디오 스템을 생성하는 것이다. 예를 들어, 음악 리코딩은 보컬 구성 요소, 기타 구성 요소(guitar component), 베이스 구성 요소, 및 드럼 구성 요소로 분리될 수 있다. 그런 다음 각각의 별도의 구성 요소들을 향상시키고 혼합하여 재생을 최적화할 수 있다.One approach to processing an audio mix is to separate the audio mix into a set of separate audio source components, and create a separate audio stem for each component of the audio mix. For example, a music recording might be separated into a vocal component, a guitar component, a bass component, and a drum component. Each of the separate components can then be enhanced and mixed to optimize playback.

그러나, 기존의 오디오 소스 분리 기술들은 음악 및 영화 산업들을 위한 고충실도 출력을 생성하는 데 필요한 고품질 오디오 스템들을 생성하는데 최적화되어 있지 않다. 오디오 소스에 오래된 사운드 리코딩들의 저품질, 단일 트랙, 노이즈가 많은 사운드 믹스처들이 포함되어 있는 경우 오디오 소스 분리가 특히 어렵다.However, existing audio source separation techniques are not optimized for producing high-quality audio stems required to produce high-fidelity output for the music and film industries. Audio source separation is particularly difficult when the audio source contains low-quality, single-track, noisy sound mixtures of old sound recordings.

전술한 내용을 고려하여, 특히 저품질 오디오 소스들로부터 고충실도 오디오를 생성하기 위한 개선된 오디오 소스 분리 시스템들 및 방법들에 대한 지속적인 요구가 있다.In view of the above, there is a continuing need for improved audio source separation systems and methods, particularly for generating high-fidelity audio from low-quality audio sources.

전술한 단점들 중 적어도 일부를 해결하는 것이 적어도 바람직한 실시예들의 목적이다. 추가적인 또는 다른 목적은 적어도 대중에게 이전 기술들에 대한 유용한 대안을 제공하는 것이다.It is an object of at least some of the above-mentioned disadvantages to address at least some of the preferred embodiments. An additional or alternative object is to at least provide the public with a useful alternative to previous techniques.

개선된 오디오 소스 분리 시스템들 및 방법들이 여기에 개시된다. 다양한 구현들에서, 단일 트랙 오디오 리코딩은 스피치 및 개별 악기들과 같은 다양한 오디오 구성요소들을 고충실도 스템들(예컨대, 별개의 또는 함께 혼합된 그룹화된 오디오 소스들의 컬렉션)으로 분리하도록 구성된 오디오 소스 분리 시스템에 제공된다.Improved audio source separation systems and methods are disclosed herein. In various implementations, single track audio recording is provided to an audio source separation system configured to separate various audio components, such as speech and individual instruments, into high fidelity stems (e.g., a collection of separate or grouped audio sources mixed together).

일부 구현들에서, 오디오 소스 분리 시스템은 단일 트랙 오디오 리코딩을 스피치, 보완 사운드들, 및 "클릭들"과 같은 아티팩트들을 포함하는 스템들로 분리하도록 훈련된 제 1 기계 학습 모델을 포함한다. 그런 다음 추가적인 기계 학습 모델들을 사용하여 스피치에서 처리 아티팩트들을 제거하고 및/또는 제 1 기계 학습 모델을 미세 조정하여 스피치 스템들을 개선할 수 있다.In some implementations, the audio source separation system includes a first machine learning model trained to separate a single track audio recording into stems containing speech, complementary sounds, and artifacts such as "clicks." Additional machine learning models can then be used to remove processing artifacts from the speech and/or to fine-tune the first machine learning model to improve the speech stems.

본 개시는 훈련된 심층 신경망(DNN)을 사용하여 단일 트랙 오디오 믹스처로부터 하나 이상의 오디오 소스 신호들을 분리하는 단계를 포함하는 오디오 처리 방법에 관한 것이다.The present disclosure relates to an audio processing method comprising the step of separating one or more audio source signals from a single track audio mixture using a trained deep neural network (DNN).

본 명세서에서 사용된 '포함하는'이라는 용어는 '~의 적어도 일부를 구성하는'을 의미한다. 본 명세서에서 '포함하는'이라는 용어를 포함하는 각 표현을 해석할 때, 그 용어 앞에 기재된 특징들이나 그 이외의 특징들도 존재할 수 있다. '포함하다', '구성하다'와 같은 관련 용어들도 동일한 방식으로 해석해야 한다.The term "comprising" as used herein means "constituting at least a part of." When interpreting each expression that includes the term "comprising" in this specification, the features described before the term or other features may also be present. Related terms such as "comprise" and "constitute" should be interpreted in the same manner.

방법은 적어도 일부 시간 도메인 인코딩 및/또는 시간 도메인 디코딩 레이어들 없이 신호 입력을 수신하고 신호 출력을 생성하도록 DNN을 구성하는 단계를 더 포함할 수 있다.The method may further comprise the step of configuring the DNN to receive a signal input and generate a signal output without at least some time domain encoding and/or time domain decoding layers.

방법은 윈도잉 함수를 적용하도록 DNN을 구성하는 단계를 더 포함할 수 있다.The method may further comprise the step of configuring the DNN to apply a windowing function.

방법은 밴딩 아티팩트들을 부드럽게 하기 위해 오버랩-추가 프로세스에 윈도잉 함수를 적용하도록 DNN을 구성하는 단계를 더 포함할 수 있다.The method may further include a step of configuring the DNN to apply a windowing function to the overlap-add process to smooth out banding artifacts.

방법은 마스크를 적용하지 않고 하나 이상의 오디오 신호들을 분리하도록 DNN을 구성하는 단계를 더 포함할 수 있다.The method may further comprise the step of configuring a DNN to separate one or more audio signals without applying a mask.

DNN 모델은 48kHz 샘플 레이트를 사용하여 훈련될 수 있다.DNN models can be trained using a 48 kHz sample rate.

DNN의 훈련은 48kHz에서 작동하는 신호 처리 파이프라인을 가질 수 있다.Training of DNNs can have a signal processing pipeline operating at 48 kHz.

방법은 분리 강도 파라미터를 DNN에 적용하는 단계를 더 포함할 수 있으며, 분리 강도 파라미터의 적용은 입력 오디오 신호에 적용되는 분리 프로세스의 강도를 제어하도록 구성된다.The method may further comprise a step of applying a separation strength parameter to the DNN, wherein the application of the separation strength parameter is configured to control a strength of a separation process applied to an input audio signal.

DNN은 복수의 라벨링된 스피치 샘플들을 포함하는 스피치 훈련 데이터세트를 사용하여 훈련될 수 있다.A DNN can be trained using a speech training dataset containing multiple labeled speech samples.

DNN은 복수의 라벨링된 음악 및/또는 노이즈 데이터 샘플들을 포함하는 넌스피치 훈련 데이터세트를 사용하여 훈련될 수 있다.The DNN can be trained using a nonspeech training dataset containing multiple labeled music and/or noise data samples.

방법은 DNN 훈련에 사용하기 위해 라벨링된 오디오 샘플들을 생성하는 단계를 더 포함할 수 있다.The method may further comprise the step of generating labeled audio samples for use in training the DNN.

방법은 DNN 훈련에 사용하기 위해 라벨링된 오디오 샘플들을 생성하는 단계를 자기 반복하는 단계를 더 포함할 수 있다.The method may further comprise a self-repeating step of generating labeled audio samples for use in training the DNN.

라벨링된 오디오 샘플들은 입력 오디오 믹스처 및/또는 DNN으로부터 출력된 오디오 소스 스템들로부터 생성될 수 있다.Labeled audio samples can be generated from input audio mixtures and/or audio source stems output from the DNN.

방법은 사전/사후 믹스처 증강을 적용하는 단계를 더 포함할 수 있다.The method may further comprise the step of applying pre/post mixture augmentation.

DNN은 낮은 가청 주파수 범위에서 별도의 오디오 스템들을 인식하기 위해 가청 주파수보다 높은 주파수에서 훈련될 수 있다.DNNs can be trained at frequencies above the audible frequency range to recognize separate audio stems in the lower audible frequency range.

증강들을 적용하는 것은 반향 및/또는 필터 확률 파라미터들을 포함할 수 있다.Applying augmentations may include reverberation and/or filter probability parameters.

방법은 상대 신호 레벨들에 기초하여 DNN을 훈련하는 동안 보이지 않는 타겟 믹스처들의 음향 품질들을 일치시키는 단계를 더 포함할 수 있다.The method may further include a step of matching acoustic qualities of unseen target mixtures while training the DNN based on relative signal levels.

일부 실시예들은 캐리어 매체에 관한 것이다. 캐리어 매체(컴퓨터 판독 가능 매체)는 머신의 하나 이상의 프로세서에 의해 실행될 때 머신이 위에서 설명된 방법들 중 어느 하나를 수행하게 하는 명령들을 저장할 수 있다. 캐리어 매체는 저장 매체 또는 신호와 같은 일시적 매체를 포함할 수 있다.Some embodiments relate to a carrier medium. A carrier medium (a computer-readable medium) can store instructions that, when executed by one or more processors of a machine, cause the machine to perform any of the methods described above. The carrier medium can include a transitory medium such as a storage medium or a signal.

일부 실시예들은 복수의 오디오 소스들로부터 생성된 오디오 신호들의 믹스처를 포함하는 오디오 입력 스트림을 수신하도록 구성된 오디오 입력; 오디오 입력 스트림을 수신하고 생성된 복수의 오디오 스템들을 생성하도록 구성된 훈련된 오디오 소스 분리 모델 모듈로서, 생성된 복수의 오디오 스템들은 복수의 오디오 소스들 중 하나 이상의 오디오 소스에 대응하는, 상기 오디오 소스 분리 모듈; 및 훈련된 오디오 소스 분리 모델을 생성된 복수의 오디오 스템들에 적어도 부분적으로 기초하여 업데이트된 오디오 소스 분리 모델로 업데이트하도록 구성된 자기 반복 훈련 시스템을 포함할 수 있고, 업데이트된 오디오 소스 분리 모델 모듈은 복수의 향상된 오디오 스템들을 생성하기 위해 오디오 입력 스트림을 재처리하도록 구성되는 시스템에 관한 것이다.Some embodiments relate to a system comprising: an audio input configured to receive an audio input stream comprising a mixture of audio signals generated from a plurality of audio sources; a trained audio source separation model module configured to receive the audio input stream and generate a plurality of generated audio stems, wherein the generated plurality of audio stems correspond to one or more of the plurality of audio sources; and a self-recurring training system configured to update the trained audio source separation model with an updated audio source separation model based at least in part on the generated plurality of audio stems, wherein the updated audio source separation model module is configured to reprocess the audio input stream to generate a plurality of enhanced audio stems.

시스템은 여기에 설명된 임의의 방법을 수행할 수 있다.The system can perform any of the methods described herein.

이 요약은 아래의 상세한 설명에서 추가로 설명되는 단순화된 형태로 개념들의 선택을 소개하기 위해 제공된다. 이 요약은 청구된 대상의 주요 특징들이나 필수적인 특징들을 식별하려는 의도가 없으며 청구된 대상의 범위를 제한하려는 의도도 없다. 청구범위에 정의된 바와 같은 방법들의 특징들, 세부사항들, 유용성들, 및 이점들에 대한 보다 광범위한 제시는 본 개시내용의 다양한 구현들에 대한 다음의 서면 설명에 제공되고 첨부 도면들에 예시되어 있다.This Summary is provided to introduce a selection of concepts in a simplified form that are further described in the Detailed Description below. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. A more extensive presentation of the features, details, utilities, and advantages of the methods as defined in the claims is provided in the following written description of various implementations of the present disclosure and is illustrated in the accompanying drawings.

특허 명세서들, 다른 외부 문서들, 또는 다른 정보 출처들을 참조한 본 명세서에서, 이것은 일반적으로 예시적인 실시예들 및 구현들의 특징들을 논의하기 위한 맥락을 제공하기 위한 것이다. 달리 구체적으로 명시하지 않는 한, 그러한 외부 문서들 또는 정보 출처들에 대한 언급은 어떤 관할권에서도 그러한 문서들 또는 정보의 출처들이 선행 기술이거나 이 기술 분야의 일반적인 일반 지식의 일부를 형성한다는 것을 인정하는 것으로 해석되어서는 안 된다.In this specification, references to patent specifications, other external documents, or other information sources are generally intended to provide a context for discussing features of exemplary embodiments and implementations. Unless specifically stated otherwise, reference to such external documents or information sources should not be construed as an admission that such documents or information sources are prior art or form part of the general knowledge in the art in any jurisdiction.

본 개시의 양상들 및 그 이점들은 다음 도면들 및 다음의 상세한 설명을 참조하여 더 잘 이해될 수 있다. 같은 참조 번호들은 하나 이상의 도면들에 도시된 같은 요소들을 식별하는 데 사용되며, 거기에 나타낸 것들은 본 개시의 구현들을 설명하기 위한 것이지 본 개시를 제한하기 위한 것이 아니라는 점이 이해되어야 한다. 도면들의 구성 요소들은 반드시 일정한 비율로 구성되어 있는 것은 아니며, 대신 본 발명의 원리들을 명확하게 설명하는 데 중점을 두고 있다.
도 1은 하나 이상의 구현들에 따른 오디오 소스 분리 시스템 및 프로세스를 도시한다.
도 2는 하나 이상의 구현들에 따른 도 1의 시스템 및 프로세스와 관련된 요소들을 도시한다.
도 3은 하나 이상의 구현들에 따른 기계 학습 데이터세트들 및 훈련 데이터로더를 도시하는 다이어그램이다.
도 4는 하나 이상의 구현들에 따른 자기 반복 데이터세트 생성 루프를 포함하는 예시적인 기계 학습 훈련 시스템을 도시한다.
도 5는 하나 이상의 구현들에 따라 기계 학습 시스템을 훈련시키는데 사용하기 위한 데이터로더의 예시적인 동작을 도시한다.
도 6은 하나 이상의 구현들에 따른 예시적인 기계 학습 훈련 방법들을 도시한다.
도 7은 하나 이상의 구현들에 따른 훈련 혼합 예를 포함하는 예시적인 기계 학습 훈련 방법들을 도시한다.
도 8은 하나 이상의 구현들에 따른 예시적인 기계 학습 프로세싱을 도시한다.
도 9는 하나 이상의 구현들에 따른 기계 학습 프로세싱에 의해 도입된 아티팩트들을 정리하도록 구성된 예시적인 후처리 모델을 도시한다.
도 10은 하나 이상의 구현들에 따른 예시적인 사용자 안내, 자기 반복 처리 훈련 루프를 도시한다.
도 11은 하나 이상의 구현들에 따른 예시적인 기계 학습 애플리케이션들을 도시하는 도 11a 및 11b을 포함한다.
도 12는 하나 이상의 구현들에 따른 예시적인 기계 학습 프로세싱 애플리케이션을 도시한다.
도 13은 하나 이상의 구현들에 따른 다중 모델 방안 처리 시스템의 하나 이상의 예들을 도시하는 도 13a, 13b, 13c, 13d, 및 13e를 포함한다.
도 14는 하나 이상의 구현들에 따른 예시적인 오디오 처리 시스템을 도시한다.
도 15는 하나 이상의 구현들에 따른, 도 1 내지 도 14의 구현들 중 하나 이상에서 사용될 수 있는 예시적인 신경망을 도시한다.Aspects of the present disclosure and advantages thereof may be better understood by reference to the following drawings and the following detailed description. Like reference numerals are used to identify like elements illustrated in one or more of the drawings, and it should be understood that what is shown therein is intended to illustrate implementations of the present disclosure and not to limit the present disclosure. The components of the drawings are not necessarily to scale, with emphasis instead being placed upon clearly illustrating the principles of the present disclosure.
FIG. 1 illustrates an audio source separation system and process according to one or more implementations.
FIG. 2 illustrates elements related to the system and process of FIG. 1 according to one or more implementations.
FIG. 3 is a diagram illustrating machine learning datasets and a training data loader according to one or more implementations.
FIG. 4 illustrates an exemplary machine learning training system including a self-repeating dataset generation loop according to one or more implementations.
FIG. 5 illustrates exemplary operation of a dataloader for use in training a machine learning system according to one or more implementations.
FIG. 6 illustrates exemplary machine learning training methods according to one or more implementations.
FIG. 7 illustrates exemplary machine learning training methods including training mixture examples according to one or more implementations.
Figure 8 illustrates exemplary machine learning processing according to one or more implementations.
FIG. 9 illustrates an exemplary post-processing model configured to clean up artifacts introduced by machine learning processing according to one or more implementations.
Figure 10 illustrates an exemplary user-guided, self-repeating training loop according to one or more implementations.
FIG. 11 includes FIGS. 11a and 11b illustrating exemplary machine learning applications according to one or more implementations.
FIG. 12 illustrates an exemplary machine learning processing application according to one or more implementations.
FIG. 13 includes FIGS. 13a, 13b, 13c, 13d, and 13e illustrating one or more examples of a multi-model solution processing system according to one or more implementations.
FIG. 14 illustrates an exemplary audio processing system according to one or more implementations.
FIG. 15 illustrates an exemplary neural network that may be used in one or more of the implementations of FIGS. 1 to 14, according to one or more implementations.

다음 설명에서는 다양한 구현들이 설명될 것이다. 설명을 위해, 구현들에 대한 철저한 이해를 제공하기 위해 특정 구성들 및 세부 사항들이 명시되어 있다. 그러나, 구현들이 특정 세부사항들 없이 실행될 수 있다는 것도 당업자에게 명백할 것이다. 또한, 잘 알려진 특징들은 설명되는 구현이 모호해지는 것을 피하기 위해 생략되거나 단순화될 수 있다.In the following description, various implementations will be described. For the purpose of the description, specific configurations and details are set forth to provide a thorough understanding of the implementations. However, it will be apparent to one skilled in the art that the implementations can be practiced without the specific details. In addition, well-known features may be omitted or simplified to avoid obscuring the described implementations.

개선된 오디오 소스 분리 시스템들 및 방법들이 여기에 개시된다. 다양한 구현들에서, 단일 트랙(예컨대 미분화된) 오디오 리코딩은 스피치 및 악기들과 같은 다양한 오디오 구성 요소들을 충실도가 높은 스템들 - 함께 혼합된 오디오 소스의 개별 또는 그룹화된 컬렉션 - 로 분리하도록 구성된 오디오 소스 분리 시스템에 제공된다. 다양한 구현들에서, 단일 트랙 오디오 리코딩은 보이지 않는 오디오 믹스처를 포함하고(예컨대, 오디오 소스들, 리코딩 환경, 및/또는 오디오 믹스처의 다른 양상들이 오디오 소스 분리 시스템에 알려지지 않음) 오디오 소스 분리 시스템들 및 방법들은 자기 반복 훈련 및 미세 조정 프로세스에서 보이지 않는 오디오 믹스처로부터 오디오 소스들을 식별 및/또는 분리하는 데 적합한다.Improved audio source separation systems and methods are disclosed herein. In various implementations, a single track (e.g., undifferentiated) audio recording is provided to an audio source separation system configured to separate various audio components, such as speech and instruments, into high-fidelity stems—individual or grouped collections of audio sources mixed together. In various implementations, the single track audio recording comprises an unseen audio mixture (e.g., the audio sources, the recording environment, and/or other aspects of the audio mixture are unknown to the audio source separation system) and the audio source separation systems and methods are adapted to identify and/or separate the audio sources from the unseen audio mixture in a self-iterative training and fine-tuning process.

본 명세서에 개시된 시스템들 및 방법들은 적어도 하나의 프로세서에 의해 실행될 때 적어도 하나의 프로세서로 하여금 본 명세서에 개시된 방법 단계들 중 어느 하나를 수행하게 하는 명령들을 포함하는 적어도 하나의 컴퓨터 판독 가능 매체에서 구현될 수 있다. 일부 실시예들은 적어도 하나의 프로세서와, 적어도 하나의 프로세서에 의해 실행될 때 적어도 하나의 프로세서가 본 문서에 개시된 임의의 방법 단계들을 수행하게 하는 명령들을 저장하는 메모리를 포함하는 컴퓨터 시스템에 관한 것이다.The systems and methods disclosed herein can be implemented in at least one computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform any of the method steps disclosed herein. Some embodiments relate to a computer system comprising at least one processor and a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform any of the method steps disclosed herein.

일부 예들에서, 모델을 훈련시키는 것은 컴퓨터 시스템의 구성 요소들이 모델로 액세스하고 모델로서 사용할 수 있는 새로운 데이터 구조를 형성하기 위해 일부 데이터 구조 또는 구조들을 처리하는 것을 포함한다. 예를 들어, 인공 지능 시스템은 하나 이상의 프로세서들, 프로그램 코드 메모리, 기록 가능한 데이터 메모리, 및 일부 입력들/출력들을 포함하는 컴퓨터를 포함할 수 있다. 기록 가능한 데이터 메모리는 훈련된 모델이나 훈련되지 않은 모델에 대응하는 일부 데이터 구조들을 보유할 수 있다. 이러한 데이터 구조는 신경망 노드의 하나 이상의 레이어들 및 상이한 레이어들의 노드들 사이의 링크들뿐만 아니라 노드들 사이의 링크들 중 적어도 일부에 대한 가중치들을 표현할 수 있다. 다른 예들에서, 상이한 유형들의 데이터 구조들은 모델을 표현할 수 있다.In some examples, training a model includes processing some data structure or structures to form a new data structure that components of the computer system can access and use as a model. For example, an artificial intelligence system may include a computer including one or more processors, a program code memory, a writeable data memory, and some inputs/outputs. The writeable data memory may hold some data structures corresponding to a trained model or an untrained model. These data structures may represent one or more layers of a neural network node and links between nodes in different layers, as well as weights for at least some of the links between the nodes. In other examples, different types of data structures may represent the model.

일부 경우에는, 모델 훈련, 모델 피딩 및/또는 모델이 입력들을 받아 출력들을 제공하도록 하는 것을 말할 때, 이는 모델을 포함하는 기록 가능한 데이터 메모리를 판독할 수 있고 모델로 작업하기 위한 프로그램 코드를 실행하는 컴퓨터의 액션들을 가리킬 수 있다. 예를 들어, 모델은 훈련 예들 자체들 및/또는 훈련 예들 및 대응하는 그라운드 트루스들(ground truths)일 수 있는 훈련 데이터 컬렉션으로 훈련될 수 있다. 일단 훈련되면, 모델은 모델에 제공되는 예에 대한 결정을 내리는 데 사용될 수 있다. 이는 컴퓨터가 입력에서 예를 나타내는 입력 데이터를 수신하고, 예와 모델을 사용하여 프로세스들을 수행하고, 출력에서 모델에 의해 이루어지거나 모델에 기초한 판정을 표현 및/또는 표시하는 출력 데이터를 출력함으로써 행해질 수 있다.In some cases, when referring to training a model, feeding a model, and/or causing a model to take inputs and provide outputs, this may refer to actions of a computer capable of reading a writable data memory containing the model and executing program code to work with the model. For example, the model may be trained with a collection of training data, which may be training examples themselves and/or training examples and corresponding ground truths. Once trained, the model may be used to make decisions about examples provided to the model. This may be done by the computer receiving input data representing examples in the input, performing processes using the examples and the model, and outputting output data representing and/or indicative of decisions made by or based on the model in the output.

매우 구체적인 예에서, 인공 지능 시스템은 수많은 자동차 사진들을 판독하고 "이들이 자동차들이다"를 나타내는 그라운드 트루스 데이터를 판독하는 프로세서를 가질 수 있다. 프로세서는 가로등 기둥들 등의 많은 수의 사진들을 판독하고 "이들은 자동차들이 아니다"를 나타내는 그라운드 트루스 데이터를 판독할 수 있다. 일부 예들에서, 모델은 그라운드 트루스 데이터가 제공되지 않고 입력 데이터 자체에 대해 훈련된다. 이러한 처리의 결과는 훈련된 모델이 될 수 있다. 그러면 인공 지능 시스템은 훈련된 모델을 통해 그것이 자동차의 이미지인지 아닌지에 대한 표시가 없는 이미지를 제공받을 수 있으며, 그것이 자동차의 이미지인지 아닌지에 대한 판정의 표시를 출력할 수 있다.In a very specific example, an AI system could have a processor that reads a large number of pictures of cars and reads ground truth data indicating "these are cars." The processor could read a large number of pictures of things like lamp posts and read ground truth data indicating "these are not cars." In some examples, the model is trained on the input data itself without being provided with ground truth data. The result of this processing can be a trained model. The AI system can then be presented with an image without any indication of whether it is an image of a car or not, via the trained model, and output an indication of its decision as to whether it is an image of a car or not.

오디오 신호들, 데이터, 리코딩들 등을 처리하기 위해, 입력들은 오디오 데이터 자체일 수도 있고, 오디오 데이터에 대한 일부 그라운드 트루스 데이터일 수도 있고 아닐 수도 있다. 그런 다음 일단 훈련되면 인공 지능 시스템은 일부 알려지지 않은 오디오 데이터를 수신하고 그 알려지지 않은 오디오 데이터에 대한 결정 데이터를 출력할 수 있다. 예를 들어, 출력 결정 데이터는 추출된 사운드들, 스템들, 주파수들 등과 관련되거나 입력 오디오 데이터의 AI가 결정한 관찰들 또는 다른 결정들과 관련될 수 있다.To process audio signals, data, recordings, etc., the inputs may be audio data itself, some ground truth data about the audio data, or not. Then, once trained, the AI system can receive some unknown audio data and output decision data about the unknown audio data. For example, the output decision data may be related to extracted sounds, stems, frequencies, etc., or may be related to observations or other decisions made by the AI of the input audio data.

훈련된 모델에 대응하는 결과 데이터 구조는 이후 훈련된 모델을 사용할 수 있는 다른 컴퓨터 시스템들로 포팅되거나 배포될 수 있다. 일단 훈련되면, 프로그램 메모리의 컴퓨터 코드는, 프로세서에 의해 실행될 때, 입력에서 이미지를 수신할 수 있으며, 데이터 구조가 훈련된 AI 모델을 나타낸다는 사실에 기초하여, 프로그램 코드는 입력을 처리할 수 있고, 입력의 특성에 관한 결정을 출력할 수 있다. 일부 구현들에서, AI 모델은 서로 얽혀 있고 쉽게 분리되지 않는 데이터 구조들과 프로그램 코드를 포함할 수 있다.The resulting data structure corresponding to the trained model can then be ported or distributed to other computer systems that can use the trained model. Once trained, the computer code in the program memory, when executed by the processor, can receive an image as an input, and based on the fact that the data structure represents the trained AI model, the program code can process the input and output a decision about the characteristics of the input. In some implementations, the AI model can include data structures and program code that are intertwined and not easily separated.

모델은 신경망 그래프의 에지들에 할당된 가중치들의 세트, 신경망 그래프와 그래프와 그 그래프와 상호 작용하는 방법에 대한 명령들을 포함하는 프로그램 코드 및 데이터, 회귀 모델 또는 분류 모델과 같은 수학적 표현, 및/또는 이 기술에 알려져 있을 수 있는 다른 데이터 구조로 표현될 수 있다. 신경 모델(또는 신경망 또는 신경망 모델)은 종종 뉴런들이라고 불리는 연결된 노드들의 세트를 표시하거나 표현하는 데이터 구조로 구현될 수 있으며, 그 중 다수는 생물학적 뉴런에 의해 수행되는 신호 처리를 모방하거나 시뮬레이션하는 데이터 구조들일 수 있다. 훈련은 그것이 어떤 다른 뉴런에 연결되고 뉴런의 입력들 및 출력들의 가중치들 및/또는 함수와 같은 각 뉴런과 연관된 파라미터들을 업데이트하는 것을 포함할 수 있다. 결과적으로, 신경 모델은 특정 목적을 달성하기 위해 출력 변수들을 생성하기 위해, 예컨대 입력 이미지 또는 입력 데이터세트가 카테고리에 적합한지 여부에 대한 이진 분류를 생성하기 위해 특정 방식으로 입력 데이터 변수들을 전달하고 처리할 수 있다. 훈련 프로세스는 복잡한 계산들을 수반할 수 있다(예컨대 그래디언트 업데이트들을 계산한 다음 그래디언트들을 사용하여 파라미터들을 레이어별로 업데이트함). 훈련은 일부 병렬 처리로 행해질 수 있다.The model may be represented as a set of weights assigned to edges in a neural network graph, program code and data including the neural network graph and instructions for interacting with the graph, a mathematical representation such as a regression model or a classification model, and/or other data structures known to the art. A neural model (or neural network or neural network model) may be implemented as a data structure representing or representing a set of connected nodes, often called neurons, many of which may be data structures that mimic or simulate signal processing performed by biological neurons. Training may involve updating parameters associated with each neuron, such as the weights and/or functions of the inputs and outputs of the neurons and how they are connected to other neurons. As a result, the neural model may be fed and processed in a specific manner by the input data variables to produce output variables to achieve a specific goal, such as producing a binary classification of whether an input image or input dataset fits into a category. The training process may involve complex computations (e.g., computing gradient updates and then using the gradients to update the parameters layer by layer). Training can be done with some parallel processing.

여기에 설명된 오디오 소스 분리를 위한 모델들 중 어느 하나의 모델은 예를 들어 "오디오 소스 분리 모델", "반복 신경망", "RNN", "심층 신경망", "DNN", "추론 모델", 또는 "신경망"을 포함하는 다수의 용어들로도 적어도 부분적으로 언급될 수 있다.Any of the models for audio source separation described herein may also be referred to, at least in part, by a number of terms including, for example, "audio source separation model", "recurrent neural network", "RNN", "deep neural network", "DNN", "inference model", or "neural network".

도 1 및 도 2는 본 개시의 하나 이상의 구현들에 따른 오디오 소스 분리 시스템 및 프로세스를 예시한다. 오디오 처리 시스템(100)은 핵심 기계 학습 시스템(110), 코어 모델 조작들(130), 및 수정된 순환 신경망(RNN) 클래스 모델(160)을 포함한다. 예시된 구현들에서, 핵심 기계 학습 시스템(110)은 도 1에 단순화된 표현으로 표현된 RNN 클래스 오디오 소스 분리 모델(RNN CASSM)(112)을 구현한다. 도시된 바와 같이, RNN-CASSM(112)은 시간 도메인 인코더(116)에 입력되는 신호 입력(114)을 수신한다. 신호 입력(114)은 RNN-CASSM 네트워크에 의해 액세스되는 저장된 오디오 파일, 별도의 시스템 구성요소로부터 수신된 오디오 입력 스트림, 또는 다른 오디오 데이터 소스로부터 수신될 수 있는 단일 채널 오디오 믹스처를 포함한다. 시간 도메인 인코더(116)는 입력 오디오 신호를 시간 도메인에서 모델링하고 오디오 믹스처 가중치들을 추정한다. 일부 구현들에서, 시간 도메인 인코더(116)는 오디오 신호 입력을 1-D 콘볼루셔널 코더에 입력하기 위해 정규화되는 별도의 파형 세그먼트들로 분할한다.FIGS. 1 and 2 illustrate an audio source separation system and process according to one or more implementations of the present disclosure. The audio processing system (100) includes a core machine learning system (110), core model operations (130), and a modified recurrent neural network (RNN) class model (160). In the illustrated implementations, the core machine learning system (110) implements an RNN class audio source separation model (RNN CASSM) (112), which is illustrated in a simplified representation in FIG. 1 . As illustrated, the RNN-CASSM (112) receives a signal input (114) that is input to a time domain encoder (116). The signal input (114) comprises a single channel audio mixture, which may be a stored audio file accessed by the RNN-CASSM network, an audio input stream received from a separate system component, or another audio data source. The time domain encoder (116) models the input audio signal in the time domain and estimates audio mixture weights. In some implementations, the time domain encoder (116) splits the audio signal input into separate waveform segments that are normalized for input to a 1-D convolutional coder.

RNN 클래스 마스크 네트워크(118)는 오디오 입력 믹스로부터 오디오 소스들을 분리하기 위한 소스 마스크들을 추정하도록 구성된다. 마스크는 소스 분리 구성요소(120)에 의해 오디오 세그먼트에 적용된다. 시간 도메인 디코더(122)는 신호 출력들(124)을 통해 출력할 수 오디오 소스들을 재구성하도록 구성된다. 예시적인 구현에서, RNN-CASSM(112)은 이제 설명될 코어 모델 조작들(130)에 따른 동작을 위해 수정된다. 블록 1.1을 참조하면, 오디오 소스 데이터는 다양한 구현들에서 48kHz 샘플 레이트를 사용하여 샘플링된다. 그러므로, RNN-CASSM(112)은 가청 주파수보다 높은 주파수에서 훈련되어 낮은 주파수 범위의 오디오의 개별 스템들을 인식할 수 있다. 48kHz보다 낮은 샘플 레이트로 훈련된 사운드 분리 모델들의 구현들은 오래된 장비(예컨대 1960년대 Nagra(tm) 장비)로 리코딩된 오디오 샘플들에서 저품질의 오디오 분리들을 생성하는 것으로 관찰되었다. 여기에 개시된 다양한 단계들, 예를 들어 소스 분리 모델을 훈련시키고, 그 샘플 레이트에 대해 적절한 하이퍼파라미터들을 설정하고, 48kHz 샘플 레이트에서 신호 처리 파이프라인을 작동시키는 것이 48kHz 샘플 레이트에서 수행된다. 오버샘플링을 포함하는 다른 샘플링 레이트들은 본 개시의 교시들과 일치하는 다른 구현들에서 사용될 수 있다. 블록 1.2에서, 인코더/디코더 프레임워크(예컨대, 시간 도메인 인코더(116) 및 시간 도메인 디코더(122))는 하나의 샘플의 스텝 크기(예컨대, 입력 신호 샘플 레이트)로 설정된다.The RNN class mask network (118) is configured to estimate source masks for separating audio sources from an audio input mix. The masks are applied to audio segments by the source separation component (120). The time domain decoder (122) is configured to reconstruct the audio sources that can be output via the signal outputs (124). In an exemplary implementation, the RNN-CASSM (112) is modified to operate in accordance with the core model operations (130) that will now be described. Referring to block 1.1, the audio source data is sampled using a 48 kHz sample rate in various implementations. Therefore, the RNN-CASSM (112) is trained at frequencies above the audible frequency and is capable of recognizing individual stems of audio in the lower frequency range. Implementations of sound separation models trained at sample rates lower than 48 kHz have been observed to produce poor quality audio separations on audio samples recorded on older equipment (e.g., Nagra(tm) equipment from the 1960s). The various steps disclosed herein, for example training a source separation model, setting appropriate hyperparameters for that sample rate, and operating the signal processing pipeline at 48 kHz sample rate, are performed at 48 kHz sample rate. Other sampling rates, including oversampling, may be used in other implementations consistent with the teachings of this disclosure. In block 1.2, the encoder/decoder framework (e.g., the time domain encoder (116) and the time domain decoder (122)) is set to a step size of one sample (e.g., the input signal sample rate).

블록 1.3을 참조하면, 전통적인 구현의 소스 분리 및 노이즈 감소를 넘어 확장하는 수정된 RNN-CASSM(160)이 생성된다. 훈련된 RNN-CASSM을 사용하여 오디오 믹스처들을 처리하면 때때로 바람직하지 않은 클릭 아티팩트들, 고조파 아티팩트, 및 광대역 노이즈 아티팩트들이 생성될 수 있는 것으로 관찰된다. 이를 해결하기 위해, 오디오 처리 시스템(100)은 블록들 1.3a, 1.3b, 및/또는 1.3c에서 참조된 바와 같이 아티팩트들을 감소 및/또는 피하고 다른 이점들을 제공하기 위해 RNN-CASSM 네트워크(112)에 수정들을 적용한다. 학습된 처리의 보다 변형적인 유형들에도 동일한 수정들이 사용될 수도 있다.Referring to block 1.3, a modified RNN-CASSM (160) is constructed that extends beyond the source separation and noise reduction of traditional implementations. It has been observed that processing audio mixtures using a trained RNN-CASSM can sometimes produce undesirable click artifacts, harmonic artifacts, and wideband noise artifacts. To address this, the audio processing system (100) applies modifications to the RNN-CASSM network (112) to reduce and/or avoid artifacts and provide other benefits, as referenced in blocks 1.3a, 1.3b, and/or 1.3c. The same modifications may also be used for more variant types of learned processing.

수정된 RNN-CASSM(160)과 연관된 모델 조작들은 이제 블록 1.3a-c를 참조하여 더 자세히 설명될 것이다. 블록 1.3a에서는 인코더 및 디코더 레이어들 중 적어도 일부가 제거되어 수정된 RNN-CASSM에서 사용되지 않는다. 스텝 크기가 하나의 샘플일 때 이렇게 제거된 레이어들은 학습된 필터들에서 중복될 수 있는 것으로 관찰된다. 블록 1.3b에서, 마스크 적용 단계(예컨대, 구성요소(120))도 오디오 처리의 적어도 일부 부분에 대해 제거된다. 그러므로, 수정된 RNN-CASSM(160)은 마스크를 적용하지 않고 오디오 소스 분리를 수행할 수 있다. 마스킹 단계는 생성된 오디오 스템들에 종종 존재하는 클리킹 아티팩트들에 잠재적으로 기여한다. 이 문제를 해결하기 위해, 오디오 처리 시스템은 마스킹의 일부 또는 전부를 생략하고 대신 RNN 클래스 마스크 네트워크(118)의 출력을 보다 직접적인 방식으로 사용할 수 있다(예컨대, 하나 이상의 분리된 오디오 소스들을 출력하도록 RNN을 훈련). 블록 1.3c에서, 윈도잉 함수는 RNN 클래스 마스크 네트워크(예컨대, RNN 네트워크(162))의 오버랩-추가 단계들에 적용된다. 모델 출력이 오버랩-추가 기능 세그먼트 길이와 관련된 주파수들에서 선형 고조파 시리즈 "밴딩" 아티팩트들을 갖는 경우, 오디오 처리 시스템은 오디오 신호를 재구성할 때 하드 에지들을 부드럽게 하기 위해 각 오버랩 세그먼트에 윈도잉 함수를 배치할 수 있다.The model manipulations associated with the modified RNN-CASSM (160) will now be further described with reference to blocks 1.3a-c. In block 1.3a, at least some of the encoder and decoder layers are removed and are not used in the modified RNN-CASSM. It is observed that these removed layers may be redundant in the learned filters when the step size is one sample. In block 1.3b, the masking step (e.g., component (120)) is also removed for at least some portion of the audio processing. Therefore, the modified RNN-CASSM (160) can perform audio source separation without applying a mask. The masking step potentially contributes to click artifacts that are often present in the generated audio stems. To address this issue, the audio processing system can omit some or all of the masking and instead use the output of the RNN class mask network (118) in a more direct manner (e.g., train the RNN to output one or more separated audio sources). In block 1.3c, a windowing function is applied to the overlap-add stages of the RNN class mask network (e.g., RNN network (162)). If the model output has linear harmonic series "banding" artifacts at frequencies related to the overlap-add function segment length, the audio processing system can apply a windowing function to each overlap segment to smooth out the hard edges when reconstructing the audio signal.

블록 1.4를 참조하면, 또 다른 코어 모델 조작(130)은 분리된 소스를 생성하기 위해 입력 신호에 적용되는 분리 마스크의 강도의 제어를 허용하는 분리 강도 파라미터의 사용이다. 분리된 소스(들)를 생성하기 위해 입력 신호(들)에 적용되는 분리 마스크의 강도를 직접 제어하기 위해, 모델의 포워드 패스 중에 얼마나 강하게 분리 마스크가 적용되는지를 결정하는 파라미터가 도입된다. 일례에서, 분리 강도 파라미터는 f (M) = M^s와 같은 마스크형 소스 분리 모델의 분리 마스크에 적용되는 함수로 표현될 수 있고, 여기서 마스크 M은 값 [0, l]을 갖고 s는 분리 강도 파라미터이다. 이 예에서, s > 1.0인 값들은 마스크가 입력 믹스처에 적용될 때 더 낮은 마스크 값들과 더 적은 타겟 소스를 생성하고, s < 1.0인 값들은 더 높은 마스크 값들과 타겟 소스와 신호의 보완들 및 노이즈 성분들의 조합을 생성한다.Referring to block 1.4, another core model manipulation (130) is the use of a separation strength parameter which allows for control of the strength of the separation mask applied to the input signal to generate the separated sources. To directly control the strength of the separation mask applied to the input signal(s) to generate the separated source(s), a parameter is introduced which determines how strongly the separation mask is applied during the forward pass of the model. In one example, the separation strength parameter can be expressed as a function applied to the separation mask of a masked source separation model such as f (M) = M^s , where the mask M has the value [0, l] and s is the separation strength parameter. In this example, values of s > 1.0 generate lower mask values and fewer target sources when the mask is applied to the input mixture, and values of s < 1.0 generate higher mask values and a combination of target sources and complements and noise components of the signal.

분리 강도 파라미터는 아래에서 더 자세히 설명될 SIPT(self-iterative processing training) 알고리즘의 자동화된 버전의 헬퍼 함수(helper function)로 구현될 수 있다. RNN-CASSM(112)이 본 명세서에 개시된 코어 모델 조작들(130) 중 하나 이상을 구현하고 본 개시의 교시들과 일치하는 추가 조작들을 포함할 수 있다는 것이 이해될 것이다.The separation strength parameter may be implemented as a helper function of an automated version of the self-iterative processing training (SIPT) algorithm, which is described in more detail below. It will be appreciated that RNN-CASSM (112) may implement one or more of the core model operations (130) disclosed herein and may include additional operations consistent with the teachings of the present disclosure.

도 3-5를 참조하여, 오디로 소스 분리를 위해 RNN-CASSM 네트워크를 훈련하기 위한 예시적인 프로세스가 설명될 것이다. 훈련 프로세스는 본 명세서에 설명된 바와 같이 오디오 소스 분리를 위해 네트워크를 훈련하도록 구성된 복수의 라벨링된 기계 학습 훈련 데이터세트들(310)을 포함한다. 예를 들어, 일부 구현들에서, 훈련 데이터세트는 오디오 믹스들, 및 오디오 믹스로부터 분리될 소스 클래스를 식별하는 그라운드 트루스 라벨을 포함할 수 있다. 다른 구현들에서, 훈련 데이터세트는 소스 분리 프로세스에 의해 생성된 오디오 아티팩트들(예컨대, 클릭들) 및/또는 하나 이상의 오디오 증강들(예컨대, 반향(reverb), 필터들)을 갖는 분리된 오디오 스템들 및 식별된 오디오 아티팩트(들) 및/또는 증강들이 제거된 향상된 오디오 스템을 식별하는 그라운드 트루스 라벨을 포함할 수 있다.Referring to FIGS. 3-5 , an exemplary process for training an RNN-CASSM network for audio source separation will be described. The training process includes a plurality of labeled machine learning training datasets (310) configured to train a network for audio source separation as described herein. For example, in some implementations, the training dataset may include audio mixes, and ground truth labels identifying the class of sources to be separated from the audio mixes. In other implementations, the training dataset may include separated audio stems having audio artifacts (e.g., clicks) and/or one or more audio enhancements (e.g., reverb, filters) generated by the source separation process, and ground truth labels identifying enhanced audio stems from which the identified audio artifact(s) and/or enhancements have been removed.

작동 시, 네트워크는 라벨링된 오디오 샘플들을 네트워크에 피딩하여 훈련된다. 다양한 구현들에서, 네트워크는 특정 소스 분리 작업(예컨대, 스피치 분리, 전경 스피치 분리, 드럼들 분리, 아티팩트 제거들 등)을 위해 별도로 훈련될 수 있는 복수의 신경망 모델들을 포함한다. 훈련은 오디오 소스 분리 데이터를 생성하기 위한 네트워크를 통한 포워드 패스를 포함한다. 각 오디오 샘플은 생성된 오디오 소스 분리 데이터와 비교되는 예상 출력을 정의하는 "그라운드 트루스"로 라벨링된다. 네트워크가 입력 오디오 샘플을 잘못 라벨링한 경우, 네트워크를 통한 백워드 패스를 사용하고 네트워크의 파라미터들을 조정하여 오분류를 정정할 수 있다. 다양한 구현들에서, Y로 표시된 출력 추정치는 L1 손실 함수(예컨대 최소 절대 편차들), L2 손실 함수(예컨대 최소 제곱들), 스케일 불변 신호 대 왜곡비(SISDR), 스케일 종속 신호 대 왜곡비(SDSDR), 및/또는 이 기술분야에 알려진 다른 손실 함수들과 같은 회귀 손실들을 사용하여 Y로 표시된 그라운드 트루스과 비교된다. 네트워크가 훈련된 후에는 검증 데이터세트(예컨대 훈련 프로세스에 사용되지 않은 라벨링된 오디오 샘플들의 세트)를 사용하여 훈련된 네트워크의 정확도를 측정할 수 있다. 훈련된 RNN-CASSM 네트워크는 오디오 입력 스트림으로부터 별개의 오디오 소스 신호들을 생성하기 위해 런타임 환경에서 구현될 수 있다. 생성된 별개의 오디오 소스 신호들은 생성된 복수의 오디오 스템들이라고도 할 수 있다. 생성된 복수의 오디오 스템들은 오디오 입력 스트림의 복수의 오디오 소스들 중 하나 이상의 오디오 소스들에 대응할 수 있다.In operation, the network is trained by feeding labeled audio samples to the network. In various implementations, the network comprises multiple neural network models that may be trained separately for specific source separation tasks (e.g., speech separation, foreground speech separation, drum separation, artifact removal, etc.). Training comprises a forward pass through the network to generate audio source separation data. Each audio sample is labeled with a "ground truth" that defines an expected output that is compared to the generated audio source separation data. If the network mislabels an input audio sample, a backward pass through the network can be used and parameters of the network can be adjusted to correct the misclassification. In various implementations, the output estimate, denoted Y, is compared to the ground truth, denoted Y, using regression losses such as an L1 loss function (e.g., least absolute deviations), an L2 loss function (e.g., least squares), scale-invariant signal-to-distortion ratio (SISDR), scale-dependent signal-to-distortion ratio (SDSDR), and/or other loss functions known in the art. After the network is trained, the accuracy of the trained network can be measured using a validation dataset (e.g., a set of labeled audio samples that were not used in the training process). The trained RNN-CASSM network can be implemented in a runtime environment to generate distinct audio source signals from an audio input stream. The generated distinct audio source signals may also be referred to as a plurality of generated audio stems. The plurality of generated audio stems may correspond to one or more of the plurality of audio sources of the audio input stream.

다양한 구현들에서, 수정된 RNN-CASSM 네트워크(160)는 다양한 조건들(예컨대, 다양한 노이즈 조건들 및 오디오 믹스들을 포함) 하에서 복수의 스피커들, 악기들, 및 다른 오디오 소스 정보를 나타내는 오디오 샘플을 포함하는 복수의(예컨대, 수천 개) 오디오 샘플들에 기초하여 훈련된다. 분리된 오디오 소스 신호들과 훈련 데이터세트로부터의 오디오 샘플들과 연관된 라벨링된 그라운드 트루스 간의 에러로부터, 딥러닝 모델은 모델이 오디오 소스 신호들을 분리할 수 있도록 허용하는 파라미터들을 학습한다. RNN-CASM(120) 및 수정된 RNN-CASM(160)은 추론 모델(120) 및 수정된 추론 모델(160), 및/또는 훈련된 오디오 분리 모델(120) 및 업데이트된 오디오 분리 모델(160)이라고도 불릴 수 있다.In various implementations, the modified RNN-CASM network (160) is trained on a plurality (e.g., thousands) of audio samples including audio samples representing multiple speakers, instruments, and other audio source information under various conditions (e.g., including various noise conditions and audio mixes). From the error between the separated audio source signals and the labeled ground truth associated with the audio samples from the training dataset, the deep learning model learns parameters that allow the model to separate the audio source signals. The RNN-CASM (120) and the modified RNN-CASM (160) may also be referred to as the inference model (120) and the modified inference model (160), and/or the trained audio separation model (120) and the updated audio separation model (160).

도 3은 출력 신호 품질 소스 분리에서 개선들 및/또는 유용한 기능을 제공할 수 있는 오디오 프로세서에 의한 훈련 동안 데이터세트들 및 데이터세트 조작들과 관련될 수 있는 기계 학습 데이터세트들(310) 및 훈련 데이터로더(350)의 다이어그램이다. 예시된 구현들에서, RNN-CASSM 네트워크는 사람들의 노래/말, 연주되는 악기들, 및 다양한 환경 노이즈들의 믹스처를 포함하는 음악 리코딩 세션의 단일 트랙 리코딩으로부터 오디오 소스들을 분리하도록 훈련된다.FIG. 3 is a diagram of machine learning datasets (310) and a training data loader (350) that may be associated with datasets and dataset manipulations during training by an audio processor that may provide improvements and/or useful functions in source separation of output signal quality. In the illustrated implementations, the RNN-CASSM network is trained to separate audio sources from a single track recording of a music recording session that includes a mixture of people singing/speaking, instruments being played, and various environmental noises.

블록 2.1을 참조하면, 예시된 구현에 사용된 훈련 데이터세트는 48kHz 스피치 데이터세트를 포함한다. 예를 들어, 48kHz 스피치 데이터세트는 다양한 마이크 거리들(예컨대 가까운 마이크 및 더 먼 마이크)에서 동시에 리코딩된 동일한 스피치를 포함할 수 있다. 한 테스트 구현에서는 각 화자에 대해 20분 이상의 스피치를 갖는 48kHz 스피치 데이터세트에 85명의 상이한 스피커들이 포함되었다. 다양한 구현들에서, 예시적인 스피치 데이터세트는 성인 남성 및 여성 화자들의 10명, 50명, 85명 이상의 많은 수의 화자들을 이용하여 10분, 20분 이상과 같은 장시간 동안 48kHz 이상의 샘플 레이트들로 리코딩하여 생성될 수 있다. 본 개시의 교시에 따라 훈련 데이터세트를 생성하기 위해 다른 데이터세트 파라미터들이 사용될 수 있다는 것이 이해될 것이다.Referring to block 2.1, the training dataset used in the illustrated implementation comprises a 48 kHz speech dataset. For example, the 48 kHz speech dataset may comprise the same speech recorded simultaneously at various microphone distances (e.g., a near microphone and a far microphone). In one test implementation, the 48 kHz speech dataset comprised 85 different speakers with more than 20 minutes of speech for each speaker. In various implementations, the exemplary speech dataset may be generated by recording at sample rates greater than 48 kHz for longer periods of time, such as 10 minutes, 20 minutes, or more, using a greater number of speakers, such as 10, 50, 85, or more adult male and female speakers. It will be appreciated that other dataset parameters may be used to generate the training dataset in accordance with the teachings of the present disclosure.

블록 2.2를 참조하면, 훈련 데이터세트는 원 아날로그 미디어에 리코딩된 디지털화된 모노 오디오 리코딩과 같은 입력 오디오 섹션들을 포함하는 넌스피치 음악 및 노이즈 데이터세트를 더 포함한다. 일부 구현들에서, 이 데이터세트는 리코딩된 음악, 보컬이 아닌 사운드들, 배경 노이즈, 디지털화된 오디오 리코딩으로부터의 오디오 미디어 아티팩트들, 및 다른 오디오 데이터의 섹션들을 포함할 수 있다. 이 데이터세트를 사용하여, 오디오 처리 시스템은 디지털화된 리코딩들에서 관심 있는 화자들의 보이스들을 다른 보이스들, 음악, 및 배경 노이즈로부터 더 쉽게 분리할 수 있다. 일부 구현들에서, 이는 스피치 또는 다른 오디오 소스 클래스의 부족으로 수동으로 기록된 리코딩의 수동으로 수집된 세그먼트들의 사용 및 그에 따라 이들 세그먼트들을 라벨링하는 것을 포함할 수 있다.Referring to block 2.2, the training dataset further comprises a nonspeech music and noise dataset comprising input audio sections, such as digitized mono audio recordings recorded on original analog media. In some implementations, the dataset may include sections of recorded music, non-vocal sounds, background noise, audio media artifacts from the digitized audio recordings, and other audio data. Using this dataset, the audio processing system can more easily separate voices of speakers of interest from other voices, music, and background noise in the digitized recordings. In some implementations, this may include using manually collected segments of manually recorded recordings lacking speech or other audio source classes and labeling these segments accordingly.

블록 2.3을 참조하면, 데이터세트는 보이지 않는 타겟 믹스처들(예컨대 오디오 처리 시스템에 알려지지 않은 믹스처)을 사용하는 점진적인 자기 반복 데이터세트 생성 프로세스를 사용하여 생성 및 수정된다. 생성된 데이터세트는 초기 분류를 생성하기 위해 하나 이상의 신경망 모델들을 통해 라벨링되지 않은 데이터세트(예컨대 소스가 분리될 보이지 않는 타겟 믹스처)를 처리하여 생성된 라벨링된 데이터세트를 포함할 수 있다. 그런 다음 이 "대략적으로 분리된 데이터"는 유틸리티 메트릭에 따라 가장 유용한 "대략적으로 분리된 데이터"를 유지하기 위해 "대략적으로 분리된 데이터" 중에서 선택하도록 구성된 정리 단계를 통해 처리된다. 예를 들어, 훈련 데이터세트의 성능은 계산된 검증 에러에 기초하여 더 나은 성능에 기여하는 데이터 샘플과 성능 저하에 기여하는 데이터 샘플들을 결정하기 위해 "대략적으로 분리된 데이터"를 포함하는 다양한 훈련 데이터세트들을 사용하여 훈련된 모델들에 검증 데이터세트를 적용함으로써 측정될 수 있다. 유틸리티 메트릭은 다음의 미세 조정 반복 전에 폐기될 "저품질"의 미세 조정 데이터를 식별하기 위해 대략적으로 분리된 데이터에서 품질 메트릭을 추정하는 함수로 구현될 수 있다. 예를 들어, 이동 RMS(Root Mean Square) 윈도 함수는 RMS 메트릭이 샘플들의 특정(또는 최소) 지속기간 동안 교정된 임계값 위에 있는 출력의 섹션들을 식별하기 위해 네트워크의 출력들에서 계산될 수 있다. 예를 들어, 이 메트릭은 아티팩트들이 발생할 가능성이 더 높은 대략적으로 분리된 소스 출력의 낮은 진폭 섹션들을 식별하는 데 사용할 수 있다. 임계값과 최소 지속기간은 폐기될 데이터의 조정을 허용하는 사용자 조정 가능 파라미터들일 수 있다.Referring to block 2.3, the dataset is generated and modified using an incremental self-repeating dataset generation process using unseen target mixtures (e.g., mixtures unknown to the audio processing system). The generated dataset may include a labeled dataset generated by processing an unlabeled dataset (e.g., an unseen target mixture from which the source is to be separated) through one or more neural network models to generate an initial classification. The "roughly separated data" is then processed through a cleaning step configured to select among the "roughly separated data" to retain the most useful "roughly separated data" according to a utility metric. For example, the performance of the training dataset may be measured by applying a validation dataset to models trained using various training datasets including the "roughly separated data" to determine data samples that contribute to better performance and data samples that contribute to poor performance based on a calculated validation error. The utility metric may be implemented as a function that estimates a quality metric on the roughly separated data to identify "low quality" fine-tuning data to be discarded before the next fine-tuning iteration. For example, a moving root mean square (RMS) window function can be computed on the outputs of the network to identify sections of the output where the RMS metric is above a calibrated threshold for a certain (or minimum) duration of samples. For example, this metric can be used to identify low amplitude sections of roughly separated source output where artifacts are more likely to occur. The threshold and minimum duration can be user-adjustable parameters that allow for adjustment of which data is discarded.

보이지 않는 타겟 믹스처들을 사용하여 생성된 점진적인 자기 반복 데이터세트는 도 4에 도시된 바와 같이 자기 반복 데이터세트 생성 루프(420)를 사용하여 생성될 수 있다. 다양한 구현들에서, 훈련된 소스 분리 모델에 의해 아직 보이지 않는 분리될 소스들을 포함하는 이전 리코딩들은 모델에 의해 성공적으로 분리될 가능성이 적다. 기존 데이터세트들은 타겟팅된 소스들에 대한 강력한 분리 모델을 훈련할 만큼 충분하지 않을 수 있으며 소스들의 새로운 리코딩들을 캡처할 기회는 존재하지 않을 수 있다. 소스들의 새로운 리코딩들을 캡처하는 대신, 이전 리코딩들의 격리된 소스들의 인스턴스들에서 추가 훈련 데이터는 수동으로 라벨링될 수 있다. 예를 들어, 식별된 화자로부터 격리된 스피치, 유사한 환경에서 유사한 장비에 리코딩된 식별된 악기의 격리된 오디오, 및/또는 분리되는 소스들과 유사한 다른 이용 가능한 오디오 세그먼트들은 수동 및/또는 자동으로 (예컨대 소스, 소스 클래스, 및/또는 환경의 식별과 같은 오디오 소스의 메타 데이터 라벨링에 기초하여) 훈련 데이터에 추가될 수 있다. 이 추가 훈련 데이터는 이전 레코딩들의 처리 성능을 향상시키기 위해 모델을 미세 조정 훈련하는 데 도움을 주기 위해 사용될 수 있다. 그러나, 이 라벨링 프로세스에는 상당한 시간과 수작업이 필요할 수 있으며, 충분한 양의 추가 훈련 데이터를 생성하기 위해 이전 리코딩들에 소스들의 격리된 인스턴스들이 충분하지 않을 수 있다.An incremental self-repeating dataset generated using unseen target mixtures can be generated using a self-repeating dataset generation loop (420) as illustrated in FIG. 4 . In various implementations, previous recordings containing sources to be separated that are not yet visible to the trained source separation model are less likely to be successfully separated by the model. The existing datasets may not be sufficient to train a robust separation model for the targeted sources, and there may be no opportunity to capture new recordings of the sources. Instead of capturing new recordings of the sources, additional training data can be manually labeled from instances of isolated sources in previous recordings. For example, isolated speech from an identified speaker, isolated audio of an identified instrument recorded on similar equipment in a similar environment, and/or other available audio segments similar to the sources to be separated can be manually and/or automatically added to the training data (e.g., based on metadata labeling of the audio sources, such as identification of the source, source class, and/or environment). This additional training data can be used to help fine-tune the model to improve its processing performance on previous recordings. However, this labeling process can be time-consuming and labor-intensive, and there may not be enough isolated instances of the sources in previous recordings to generate a sufficient amount of additional training data.

예시된 세대 세분화 툴은 이러한 어려움들을 극복할 수 있다. 한 가지 방법에서는 대략적인 일반 모델(410)이 일반 훈련 데이터세트에 대해 훈련된다. 일반 훈련 데이터세트는 라벨링된 소스 오디오 데이터 및 라벨링된 노이즈 오디오 데이터를 포함할 수 있다. 일반 모델(410)은 일반 소스 분리 모델(410) 또는 훈련된 오디오 소스 분리 모델로 불릴 수 있다. 훈련 데이터세트는 복수의 데이터세트들을 포함할 수 있으며, 복수의 데이터세트들 각각은 소스 분리 문제를 해결하기 위해 시스템을 훈련하도록 구성된 라벨링된 오디오 샘플들을 포함한다. 복수의 데이터세트들은 복수의 라벨링된 스피치 샘플들을 포함하는 스피치 훈련 데이터세트, 및/또는 복수의 라벨링된 음악 및/또는 노이즈 데이터 샘플들을 포함하는 넌스피치 훈련 데이터세트를 포함할 수 있다. 분리될 보이지 않는 오디오 믹스처를 포함하는 이용 가능한 이전 리코딩들은 프로세스(422)에서 일반 모델로 처리되어 격리된 오디오(예컨대 오디오 스템들)의 2개의 라벨링된 데이터세트들 - 즉, 대략적으로 분리된 이전 리코딩들 소스 데이터세트(424)와 대략적으로 분리된 이전 리코딩들 노이즈 데이터세트(426) - 을 생성한다.The illustrated generational segmentation tool can overcome these challenges. In one approach, a roughly general model (410) is trained on a general training dataset. The general training dataset may include labeled source audio data and labeled noise audio data. The general model (410) may be referred to as a general source separation model (410) or a trained audio source separation model. The training dataset may include a plurality of datasets, each of which includes labeled audio samples configured to train a system to solve the source separation problem. The plurality of datasets may include a speech training dataset including a plurality of labeled speech samples, and/or a nonspeech training dataset including a plurality of labeled music and/or noise data samples. Available previous recordings containing unseen audio mixtures to be separated are processed with a general model in process (422) to produce two labeled datasets of isolated audio (e.g., audio stems) - a roughly separated previous recordings source dataset (424) and a roughly separated previous recordings noise dataset (426).

다양한 구현들에서, 특정 문제(예컨대 스피치 대 넌스피치)를 해결하기 위해 시스템을 훈련시키기 위해 선택된 라벨링된 오디오 샘플의 컬렉션을 제공하는 다른 훈련 데이터세트들이 사용될 수 있다. (i) 일부 구현들에서, 예를 들어 훈련된 데이터세트는 (i) 음악 대 사운드 효과들 대 폴리(foley), (ii) 밴드의 다양한 악기들에 대한 데이터세트들, (iii) 서로 분리된 여러 인간 화자들, (iv) 실내 반향의 소스들, 및/또는 (v) 다른 훈련 데이터세트들을 포함할 수 있다. 다음에, 임계값 메트릭을 사용하여 결과들을 컬링하여(프로세스 428), 사용자가 선택할 수 있는 선택된 RMS(root mean square) 레벨 아래의 오디오 윈도를 제거한다. 일부 구현들에서, 이동 RMS는 오디오 데이터를 동일한 지속기간 오버랩 원도들로 분할하고 각 윈도에 대한 RMS를 계산함으로써 계산될 수 있다. RMS 레벨은 유틸리티 메트릭 또는 품질 메트릭으로 불릴 수 있으며 RMS 레벨은 대체 유틸리티 메트릭들 또는 품질 메트릭들 중 하나의 옵션일 수 있다. 품질 메트릭은 연관된 복수의 오디오 스템들에 기초하여 계산될 수 있다.In various implementations, other training datasets may be used that provide a collection of labeled audio samples selected to train the system to solve a particular problem (e.g., speech vs. nonspeech). (i) In some implementations, for example, the trained dataset may include (i) music vs. sound effects vs. foley, (ii) datasets for various instruments in a band, (iii) multiple human speakers separated from each other, (iv) sources of room reverberation, and/or (v) other training datasets. Next, the results are culled using a threshold metric (process 428) to remove audio windows below a selected root mean square (RMS) level that the user can select. In some implementations, the moving RMS may be computed by dividing the audio data into equal duration overlapping circles and computing the RMS for each window. The RMS level may be referred to as a utility metric or a quality metric, and the RMS level may be an option for one of the alternative utility metrics or quality metrics. The quality metric may be computed based on a plurality of associated audio stems.

다음에, 프로세스 430에서 컬링된 자기 반복 데이터세트(예컨대 컬링된 결과들은 오디오 훈련 데이터세트에 추가되고, 컬링된 자기 반복 데이터세트는 컬링된 동적 진화 데이터세트라고도 불림)를 사용하여 새로운 모델이 훈련되어 개선된 모델(432)를 생성하고 레코딩들을 처리할 때 그 성능을 개선한다. 개선된 모델(432)은 복수의 향상된 오디오 스템들을 생성하기 위해 오디오 입력 스트림을 재처리하도록 구성될 수 있다. 이 개선된 모델은 훈련된 오디오 소스 분리 모델의 업데이트이다. 이 개선된 모델은 업데이트된 오디오 소스 분리 모델(432)로 불릴 수 있다.Next, a new model is trained using the culled self-repeating dataset (e.g., the culled results are added to the audio training dataset, and the culled self-repeating dataset is also referred to as the culled dynamically evolving dataset) in process 430 to generate an improved model (432) and improve its performance when processing the recordings. The improved model (432) can be configured to reprocess the audio input stream to generate a plurality of improved audio stems. The improved model is an update of the trained audio source separation model. The improved model can be referred to as an updated audio source separation model (432).

일부 구현들에서, 오디오 훈련 데이터세트는 반복적인 미세 조정 프로세스 동안 큐레이팅되어 타겟 입력 믹스처와 관련되지 않은 데이터를 제거할 수 있다. 예를 들어, 입력 믹스처는 입력 믹스처의 다양한 소스들을 식별/분류할 수 있으며, 식별/분류되지 않았거나 소스 분리 작업과 달리 관련되지 않은 특정 다른 소스 카테고리들을 남겨 둘 수 있다. 이러한 "관련 없는" 소스 카테고리들(예컨대 타겟 믹스처에서 발견되지 않은 카테고리들, 사용자가 소스 분리 작업과 관련 없는 것으로 식별한 카테고리들, 및/또는 다른 기준에 의해 정의된 다른 관련 없는 소스 카테고리들)과 연관된 훈련 데이터가 오디오 훈련 데이터세트로부터 컬링되어 훈련 데이터세트가 타겟 입력 믹스처의 컨텐트에 점점 더 구체화되도록 허용한다.In some implementations, the audio training dataset may be curated during the iterative fine-tuning process to remove data that is not relevant to the target input mixture. For example, the input mixture may identify/classify various sources of the input mixture, leaving certain other source categories that are not identified/classified or otherwise not relevant to the source separation task. Training data associated with these "irrelevant" source categories (e.g., categories not found in the target mixture, categories identified by the user as being irrelevant to the source separation task, and/or other irrelevant source categories defined by other criteria) is culled from the audio training dataset, allowing the training dataset to become increasingly specific to the content of the target input mixture.

이 프로세스(420)는 반복적으로 반복될 수 있으며, 각각 모델의 분리 품질(예컨대 소스 분리의 정확성 및/또는 품질을 향상시키기 위한 미세 조정)을 개선한다. 후속 반복시, 프로세스(420)는 추가 RMS 레벨을 사용할 수 있으며, 이에 따라 추가 RMS 레벨은 이전 RMS 레벨보다 크다. 이 프로세스를 통해 초기 일반 소스 격리 또는 분리 모델을 더욱 자동화된 방식으로 개선할 수 있다. 다양한 스테이지들에서의 루핑(looping)은 더 큰 관련 믹스처들에 비해 개선들을 보인 것으로 관찰되었다. 일반 모델(410) 및 개선된 모델(432)은 또한 추론 모델(410) 및 수정된 추론 모델(432), 및/또는 훈련된 오디오 분리 모델(410) 및 업데이트된 오디오 분리 모델(432)이라고 할 수 있다.This process (420) may be repeated iteratively, each time improving the separation quality of the model (e.g., fine-tuning to improve the accuracy and/or quality of source separation). In subsequent iterations, the process (420) may use additional RMS levels, such that the additional RMS levels are larger than the previous RMS levels. This process allows for improving the initial generic source isolation or separation model in a more automated manner. Looping at various stages has been observed to show improvements over larger related mixtures. The generic model (410) and the improved model (432) may also be referred to as the inference model (410) and the modified inference model (432), and/or the trained audio separation model (410) and the updated audio separation model (432).

분리 품질(예컨대, 오디오 충실도)의 개선은 시스템에 의해 측정될 수 있고/있거나 사용자 인터페이스를 통해 시스템에 피드백을 제공하고/하거나 프로세스(420)의 하나 이상의 단계들을 감독하는 사용자에 의해 평가될 수 있다. 일부 구현들에서, 프로세스(420)는 분리 품질을 추정하기 위해 MOS(Mean Opinion Score)를 계산하기 위해 알고리즘들 및/또는 사용자 평가의 조합을 사용할 수 있다. 예를 들어, 알고리즘은 소스 분리 동작 중에 생성된 아티팩팅의 양을 추정할 수 있고, 이는 결국 네트워크 출력의 전체 품질과 관련이 있다. 일부 구현들에서, 소스 분리 동작 동안 생성된 아티팩팅의 양의 추정에는 신호에서 오디오 아티팩트들을 분리하도록 훈련된 신경망 모델을 통해 분리된 소스들을 피딩하여 이러한 오디오 아티팩트들의 존재 및/또는 강도의 측정을 가능하게 하는 것을 포함된다. 오디오 아티팩트들의 강도는 각 반복에서 결정되고 반복들 사이에서 추적되어 모델을 미세 조정할 수 있다. 일부 구현들에서, 반복 프로세스는 반복들에 걸쳐 추정된 분리 품질이 개선을 멈출 때까지 및/또는 추정된 분리 품질이 하나 이상의 미리 결정된 품질 임계값들을 만족할 때까지 계속된다.Improvements in separation quality (e.g., audio fidelity) may be measured by the system and/or evaluated by a user who provides feedback to the system via a user interface and/or supervises one or more steps of the process (420). In some implementations, the process (420) may use a combination of algorithms and/or user ratings to compute a Mean Opinion Score (MOS) to estimate the separation quality. For example, the algorithm may estimate the amount of artifacting produced during the source separation operation, which in turn relates to the overall quality of the network output. In some implementations, estimating the amount of artifacting produced during the source separation operation includes feeding the separated sources through a neural network model trained to separate audio artifacts from the signal, thereby enabling a measurement of the presence and/or intensity of such audio artifacts. The intensity of the audio artifacts may be determined at each iteration and tracked between iterations to fine-tune the model. In some implementations, the iterative process continues until the estimated separation quality stops improving across iterations and/or until the estimated separation quality satisfies one or more predetermined quality thresholds.

블록들 2.4 및 2.4a를 참조하면, 기계 학습 훈련 데이터로더(350)는 훈련 중에 보이지 않는 타겟 믹스처들의 음향 품질들(예컨대, 마이크로부터 인지된 거리, 필터링, 반향, 에코들, 비선형 왜곡, 스펙트럼 분포, 및/또는 다른 측정 가능한 오디오 품질들)을 일치시키도록 구성된다. 효과적인 감시 소스 분리 모델을 훈련할 때의 과제는 타겟 믹스처들의 소스들의 품질들과 가능한 한 가깝게 일치하도록 데이터세트 예들이 이상적으로 큐레이팅되어야 한다는 것이다. 예를 들어, 화자가 반향이 있는 강당에서 마이크에 대고 말하는 믹스처로부터 격리되고, 리코딩이 청중석의 리코딩 장치에서 일정 거리 떨어진 곳에서 캡처된 경우, 같은 화자가 고품질 스피치 데이터세트가 리코딩되어 있을 수 있는 곳과 같이 중립 비반향 공간에서 동일한 화자가 마이크에 직접 대고 말하여 리코딩했을 때의 소리와 비교될 수 있다. 그런 다음 목표는 일반적으로 타겟 입력 믹스처들에서 더 낮은 차이를 생성하는 훈련 중에 고품질 스피치 데이터세트 샘플들에 증강을 추가하는 것이다. 이 예에서, "반향" 증강은 강당의 반향을 시뮬레이션을 위해 추가될 수 있고, "비선형 왜곡" 증강은 사운드 시스템에 의해 증폭된 화자의 보이스를 시뮬레이션하고, "필터" 증강은 리코딩 장치로부터 사운드 시스템의 거리를 시뮬레이션하기 위해 추가될 수 있다.Referring to blocks 2.4 and 2.4a, the machine learning training data loader (350) is configured to match the acoustic qualities of unseen target mixtures during training (e.g., perceived distance from the microphone, filtering, reverberation, echoes, nonlinear distortion, spectral distribution, and/or other measurable audio qualities). A challenge in training an effective supervised source separation model is that the dataset examples should ideally be curated to match the qualities of the sources of the target mixtures as closely as possible. For example, a mix of a speaker speaking into a microphone in an echogenic auditorium, captured at a certain distance from a recording device in the audience, can be compared to the sound of the same speaker speaking directly into a microphone in a neutral anechoic space, where a high-quality speech dataset may be recorded. The goal is then to add augmentation to the high-quality speech dataset samples during training that generally produces a lower difference in the target input mixtures. In this example, "reverberation" enhancement might be added to simulate the reverberation of an auditorium, "nonlinear distortion" enhancement might be added to simulate the speaker's voice being amplified by the sound system, and "filter" enhancement might be added to simulate the distance of the sound system from the recording device.

다양한 구현들에서, 솔루션은 사전 및 사후 믹스처 증강 모듈들을 포함하는 계층적 믹스처 버스 스키마를 포함한다. 이상적인 타겟 믹스처들을 생성하는 것은 훈련을 위해 새로운 스타일의 믹스처가 필요할 때마다 수동으로 제작하는 것을 지루하거나 비실용적이게 할 수 있다. 계층적 믹스 버스 스키마는 감시된 소스 분리 훈련 중에 임의로 복잡한 무작위화된 "소스들"와 "노이즈"의 용이한 정의를 허용한다. 데이터로더는 필터링, 잔향, 상대 신호 레벨들 훈련, 및 향상된 소스 분리 강화 결과들을 위한 다른 오디오 품질들과 같은 품질들을 대략적으로 일치시킨다. 기계 학습 데이터 로더는 모델을 훈련시키는 동안 소스 데이터에서 동적으로 생성되는 "소스들" 및 "노이즈" 믹스처들의 용이한 정의를 허용하는 계층적 스키마를 사용한다. 믹스 버스들은 반향들과 같은 선택적 증강들, 또는 무작위화 파라미터들이 포함된 필터들을 허용한다. 적절하게 분류된 데이터세트 미디어를 원재료로 사용하면, 이것은 원하는 소스 분리 타겟 믹스처들을 모방하는 훈련 데이터세트들의 쉬운 설정을 허용한다.In various implementations, the solution includes a hierarchical mix bus scheme that includes pre- and post-mixture augmentation modules. Generating ideal target mixes can be tedious or impractical to manually create whenever a new style of mix is needed for training. The hierarchical mix bus scheme allows easy definition of arbitrarily complex randomized "sources" and "noise" during supervised source separation training. The data loader roughly matches qualities such as filtering, reverberation, relative signal levels training, and other audio qualities for improved source separation augmentation results. The machine learning data loader uses a hierarchical scheme that allows easy definition of "sources" and "noise" mixes that are dynamically generated from the source data while training the model. The mix buses allow optional augmentations such as reverberation, or filters with randomization parameters. Using properly labeled dataset media as raw material, this allows easy setup of training datasets that mimic the desired source separation target mixes.

예시적인 단순화된 스키마 표현(550)이 도 5에 도시되어 있다. 훈련 믹스처들 스키마는 소스들 및 노이즈에 대한 별도의 옵션들을 포함하고 dB 범위, 소스 결정과 연관된 확률, 실내 임펄스 응답, 필터(들)), 및 다른 기준과 같은 기준을 포함한다.An exemplary simplified schema representation (550) is illustrated in Fig. 5. The training mixtures schema includes separate options for sources and noise and includes criteria such as dB range, probabilities associated with source decisions, room impulse response, filter(s), and other criteria.

블록 2.4b를 참조하면, 데이터로더는 훈련 중에 타겟 믹스처에 적용되는 필터들, 비선형 함수들, 컨볼루션들을 포함하는 사전/사후 믹스처 증강을 더 제공한다. 다양한 구현들에서, 관련 증강들이 식별되어 훈련 데이터세트에 추가되고, 분리된 오디오 스템들이 후처리된다(예컨대 여기에 설명된 파이프라인들 사용하여). 소스 분리 모델은 증강들을 사용하여 분리된 소스를 변환한다는 추가 목표를 가지고 훈련될 수 있다. 일부 구현들에서, 시스템은 입력 믹스처에서 소리가 날 수 있는 것처럼 소스를 정확하게 분리하도록 훈련될 수 있다. 일부 구현들에서, 시스템은 적절한 증강들, 예를 들어 이용 가능한 훈련 데이터세트들을 사용할 때 타겟 믹스처로부터 최소 차이를 가져오는 증강들을 적용함으로써 분리된 소스의 일부 품질들을 개선하도록 더 훈련될 수 있다. 차이 및 적절한 증강들은 알고리즘적으로 및/또는 사용자에 의한 평가에 의해 추정될 수 있다. 예를 들어, 입력 믹스처의 보이스는 닫힌 문 뒤에서 리코딩되어 있기 때문에 필터링되고 이해하기 어려울 수 있다. 이 예에서 분리된 소스는 증강될 수 있다(예컨대, 닫힌 문 뒤의 보이스의 소리와 일반적으로 일치하도록 입력 보이스 데이터세트를 열화시킨다). 그러나, 훈련 중 분리된 타겟 소스 출력은 이 예에서 증강되지 않으므로(예컨대 증강된 보이스 입력 대 훈련 중 대응하는 고품질 보이스 타겟 출력 데이터세트), 네트워크는 후처리 증강에 의해 이와 동일한 변환을 근사하도록 훈련된다.Referring to block 2.4b, the dataloader further provides pre/post mixture augmentations, including filters, nonlinear functions, and convolutions, applied to the target mixture during training. In various implementations, relevant augmentations are identified and added to the training dataset, and the separated audio stems are post-processed (e.g., using the pipelines described herein). The source separation model can be trained with the additional goal of transforming the separated source using the augmentations. In some implementations, the system can be trained to separate the source accurately as it would sound in the input mixture. In some implementations, the system can be further trained to improve some qualities of the separated source by applying appropriate augmentations, e.g., augmentations that result in the smallest difference from the target mixture when using the available training datasets. The difference and appropriate augmentations can be estimated algorithmically and/or by user evaluation. For example, the voice in the input mixture may be filtered and difficult to understand because it was recorded behind a closed door. In this example, the isolated source can be augmented (e.g., by degrading the input voice dataset to generally match the sound of a voice behind a closed door). However, since the isolated target source output during training is not augmented in this example (e.g., augmented voice input versus the corresponding high-quality voice target output dataset during training), the network is trained to approximate this same transformation by post-processing augmentation.

일부 구현들에서, 타겟 소스는 믹스처에서 그것의 현재 표현의 변환된 버전일 수 있다. 예를 들어, 대역폭이 제한된 리코딩을 보다 완전한 주파수 스펙트럼으로 복원하거나 가려진 배경 스피커를 분리하여 근접 충실도를 증가시켜야 할 수도 있다. 이러한 요구들은 사용자 결정일 수도 있고, 변환 모델 자체에 맡겨 타겟 출력 훈련 세트로부터의 입력 차이에 기초하여 자동으로 해결할 수 있다. 예를 들어, 모델이 훈련 중 무작위로 증강된 스피치 데이터세트들을 사용할 때 많은 반향 없이 고품질 근접 스피치를 출력하도록 훈련되었다면, 많은 반향 없이 이러한 고품질 근접 스피치를 포함하는 믹스처들을 입력하면 이들 입력들에 최소 변화들을 가져오는 경향이 있다. 그러나, 이러한 품질들에서 벗어나는 스피치를 포함하는 믹스처들을 입력하면 이들 입력 스피치 믹스처들을 많은 반향 없이 고품질 근접 스피치와 유사하게 변환하는 경향이 있다.In some implementations, the target source may be a transformed version of its current representation in the mixture. For example, a bandwidth-limited recording may need to be restored to a more complete frequency spectrum, or an occluded background speaker may need to be separated to increase proximity fidelity. These requirements may be user-defined, or may be left to the transformation model itself to resolve automatically based on input differences from the target output training set. For example, if the model has been trained to output high-quality proximity speech without much reverberation when using randomly augmented speech datasets during training, then inputting mixtures containing such high-quality proximity speech without much reverberation will tend to result in minimal changes to these inputs. However, inputting mixtures containing speech that deviates from these qualities will tend to transform these input speech mixtures to resemble high-quality proximity speech without much reverberation.

증강 모듈들은 타겟 출력과 동일한 소스의 대안적으로 증강된 버전과 함께 입력으로서 증강 소스로 구성된 훈련 예들을 생성하도록 데이터로더에 의해 사용될 수 있다. 이는 훈련 믹스처의 일부일 때 타겟 소스가 대안적으로 증강된 컨텍스트로 표현될 수 있는 변환 예들을 허용한다. 수정된 RNN-CASSM을 훈련시키는 동안 사용되면 이것은 오디오 처리 시스템이 "필터링 해제", "반향 제거(dereverberation)", 및 심하게 가려진 타겟 소스들의 심층 복구와 같은 동작들을 학습할 수 있게 허용한다.Augmentation modules can be used by the dataloader to generate training examples consisting of an augmented source as input along with an alternatively augmented version of the same source as the target output. This allows for transformed examples where the target source can be represented in an alternatively augmented context when it is part of the training mixture. When used while training the modified RNN-CASSM, this allows the audio processing system to learn behaviors such as "defiltering", "de-echoing", and deep restoration of heavily occluded target sources.

예시된 구현들에서의 애플리케이션 예들(500)은: (i) 필터로 믹스처들을 증강하는 것을 포함하는 필터링 해제, (ii) 반향으로 믹스처를 증강하는 것을 포함하는 반향 제거, (iii) 필터와 반향로 소스들을 증강하는 것을 포함하는 배경 스피커 복구, (iv) 왜곡 복구 , 왜곡으로 믹스처를 증강시키는 것을 포함하는 왜곡 복구, 및 (v) 갭들로 믹스처를 증강시키는 것을 포함하는 갭들 복구를 포함한다.Application examples (500) in the illustrated implementations include: (i) defiltering, including augmenting the mixture with a filter, (ii) echo cancellation, including augmenting the mixture with reverberation, (iii) background speaker recovery, including augmenting sources with a filter and echo, (iv) distortion recovery, including augmenting the mixture with distortion, and (v) gaps recovery, including augmenting the mixture with gaps.

도 6 및 7를 참조하여, 기계 학습 훈련 구현 방법들(600)의 예시적인 구현들이 설명될 것이다. 이러한 예들에서, 기계 학습 훈련 방법들은 출력 신호 품질 및 소스 분리의 개선들 및/또는 유용한 기능을 위한 훈련 방법들과 관련하여 설명된다. 블록 3.1을 참조하면, 제 1 기계 학습 훈련 방법은 훈련된 네트워크 샘플 레이트(예컨대 24kHz에서 48kHz)를 업스케일링하는 것을 포함한다. 시간, 컴퓨팅 리소스들 등의 제한들로 인해, 모델은 24kHz 데이터세트를 사용하여 24kHz에서 훈련될 수 있지만 출력 품질에는 제한들이 있다. 24kHz로 훈련된 모델에 대해 수행되는 업스케일링 프로세스는 48kHz에서 작동하도록 제공할 수 있다. 한 가지 예시적인 프로세스는 인코더/디코더 레이어들과 이에 바로 연결된 것들이 폐기되는 동안 마스킹 네트워크의 학습된 파라미터들의 내부 블록들을 유지하는 것을 포함한다. 즉, 내부 분리층들만이 48kHz 인코더/디코더를 사용하여 새로 초기화된 모델에 이식된다. 다음에, 훈련되지 않은 48kHz 인코더/디코더와 바로 연결된 이들 레이어들은 48Khz 데이터세트를 사용하여 미세 조정되는 반면 승계된 네트워크(inherited network)는 고정된 상태를 유지한다. 이는 이제 적응된 승계 레이어들을 나타내는 훈련/검증 중에 허용 가능한 검증/손실 값들(예컨대 L1, L2, SISDR, SDSDR 또는 다른 손실 계산)이 다시 나타날 때까지 수행된다. 예를 들어, 허용 가능한 검증/손실 값들은 모델 성능과 비교하여 이전 훈련 세션들에서 나타난 값들에 대한 추세를 관찰함으로써, 미리 결정된 임계 손실 값과 비교함으로써, 또는 다른 접근 방식을 통해 결정될 수 있다. 훈련 중, 손실 값들은 이상적으로 최소화되는 경향이 있지만, 실제로 손실 값들은 손실 값들이 최소화에서 멀어지기 시작하는 경우와 같이 훈련 중에 중요한 문제들을 알리는 데 유용할 수도 있다. 훈련 중에 손실 값들이 개선되지 않더라도, 소스 분리의 품질로 측정된 모델 성능은 훈련을 계속하면 여전히 향상될 수 있다는 것도 관찰되었다.With reference to FIGS. 6 and 7 , exemplary implementations of machine learning training implementation methods (600) will be described. In these examples, the machine learning training methods are described in relation to improvements in output signal quality and source separation and/or training methods for useful features. Referring to block 3.1 , a first machine learning training method includes upscaling a trained network sample rate (e.g., from 24 kHz to 48 kHz). Due to limitations in time, computing resources, etc., a model may be trained at 24 kHz using a 24 kHz dataset, but with limitations in output quality. The upscaling process performed on a model trained at 24 kHz can provide for it to operate at 48 kHz. One exemplary process includes retaining internal blocks of learned parameters of the masking network while discarding the encoder/decoder layers and those directly connected thereto. That is, only the internal separation layers are transplanted into a newly initialized model using a 48 kHz encoder/decoder. Next, these layers, which are directly connected to the untrained 48kHz encoder/decoder, are fine-tuned using the 48kHz dataset while the inherited network remains fixed. This is done until acceptable validation/loss values (e.g. L1, L2, SISDR, SDSDR or other loss calculations) reappear during training/validation, which now represent the adapted inherited layers. For example, acceptable validation/loss values can be determined by observing the trend of values seen in previous training sessions compared to model performance, by comparing them to a predetermined threshold loss value, or by other approaches. During training, the loss values ideally tend to be minimized, but in practice, the loss values can be useful in indicating important problems during training, such as when the loss values start to deviate from the minimum. It has also been observed that even if the loss values do not improve during training, the model performance, as measured by the quality of source separation, can still improve with continued training.

마지막으로, 모델이 48Khz에서 더욱 발전할 수 있도록 모든 레이어들에 걸쳐 미세 조정 훈련이 계속되며, 최종 결과는 잘 수행하는 48Khz 모델이다. 일부 구현들에서, 시스템은 더 낮은 샘플 레이트에서 훈련하고, 적절한 레이어들을 더 높은 샘플 레이트 모델 아키텍처로 승계하고, 먼저, 승계된 레이어들의 파라미터들이 고정되어 있는 동안 훈련되지 않은 레이어들이 훈련되고, 두 번째로, 성능이 더 낮은 샘플 레이트 모델의 성능과 일치하거나 이를 능가할 때까지 전체 모델이 미세 조정되는 2단계 훈련 프로세스를 수행함으로써 높은 신호 처리 샘플 레이트에서 동작하도록 훈련된다. 이 프로세스는 다음을 포함하되 이에 국한되지 않는 많은 반복들을 통해 수행될 수 있다:Finally, the fine-tuning training continues across all layers so that the model can further improve at 48Khz, with the end result being a well-performing 48Khz model. In some implementations, the system is trained to operate at high signal processing sample rates by training at a lower sample rate, inheriting appropriate layers to the higher sample rate model architecture, and performing a two-step training process in which first, untrained layers are trained while the parameters of the inherited layers are fixed, and second, the entire model is fine-tuned until its performance matches or exceeds that of the lower sample rate model. This process can be performed over many iterations, including but not limited to:

a) 6kHz로 훈련a) Training at 6kHz

b) 12kHz까지 업스케일b) Upscale to 12kHz

c) 24kHz까지 업스케일c) Upscale to 24kHz

d) 48kHz까지 업스케일d) Upscale to 48kHz

블록 3.2를 참조하면, 다중 음성 소스 믹스처들이 스피치 분리 모델의 성능을 향상시키는 데 사용된다(예컨대 소스= 전경, 배경 및 원거리 보이스들, 노이즈, 및 음악 믹스처들). 처음에 단일 보이스와 노이즈/음악 믹스처들에 대해 훈련된 스피치는 잘 작동하지 않을 수 있으며 처리된 결과들은 원 소스 미디어에서 일관되게 추출하는 데 문제들이 있을 수 있으며 상당한 아티팩팅을 겪을 수 있다. 단일 보이스 대 믹스처들로 훈련하는 대신, 오디오 처리 시스템은 전경 및 배경 보이스들과 같은 근접성 변화들을 시뮬레이션하기 위해 레이어링된 다수의 보이스 소스들을 사용하여 실질적으로 향상된 결과들을 제공할 수 있다. 이 접근 방식은 악기들에도 적용할 수 있는데, 예를 들어 한 번에 하나의 샘플만이 아닌 믹스처에서 다수의 오버레이된 기타 샘플들에 적용될 수 있다. 다양한 구현들에서, 다양한 레이어들의 시나리오들을 포함하는 훈련 샘플들이 선택되어 (예컨대, 사용자 입력, 식별된 소스 클래스, 및/또는 반복 훈련 동안 보이지 않는 오디오 믹스처의 분석에 기초하여) 보이지 않는 오디오 믹스처를 일치시키고/하거나 근사할 수 있다.Referring to block 3.2, multi-voice source mixtures are used to improve the performance of the speech separation model (e.g., source = foreground, background and distant voices, noise, and music mixtures). Speech initially trained on a single voice and noise/music mixtures may not perform well and the processed results may have problems in extracting consistently from the original source media and may suffer from significant artifacting. Instead of training on a single voice vs. mixtures, the audio processing system can provide substantially improved results by using multiple voice sources layered to simulate proximity variations such as foreground and background voices. This approach can also be applied to instruments, for example, to multiple overlaid guitar samples in the mixture rather than just one sample at a time. In various implementations, training samples containing scenarios of various layers can be selected to match and/or approximate the unseen audio mixture (e.g., based on user input, identified source classes, and/or analysis of the unseen audio mixture during iterative training).

훈련 믹스처들의 예(700)는 소스와 노이즈의 믹스처들(704)을 포함하는 스피치 분리 훈련 믹스처들(702)을 포함한다. 소스 믹스처(706)는 전경 스피치(708), 배경 스피치(710), 및 원거리 스피치(712)의 믹스처를 증강된 무작위 조합들로 포함할 수 있다. 노이즈 믹스처(714)는 악기들(716), 실내 음색들(718), 및 히스(hiss)(720)의 믹스처를 증강된 무작위 조합들로 포함한다.Examples of training mixtures (700) include speech separation training mixtures (702) that include mixtures of sources and noises (704). The source mixture (706) may include a mixture of foreground speech (708), background speech (710), and distant speech (712) in augmented random combinations. The noise mixture (714) may include a mixture of instruments (716), room tones (718), and hiss (720) in augmented random combinations.

도 8-10을 참조하여, 기계 학습 프로세싱 800의 구현들이 이제 출력 신호 품질 및 소스 분리의 개선들 및/또는 유용한 기능들에 기여한 기계 학습 모델들을 사용한 처리 방법들과 관련하여 설명될 것이다(예컨대, 본 명세서에서 이전에 논의된 바와 같음). 블록 4.1을 참조하면, 기계 학습 처리는 합 분리 소스(들)의 보완을 추가 출력으로 포함할 수 있다. 다른 구현들에서 모델은 스피치와 폐기된 음악/노이즈를 출력한다. 보완 출력은 보완 출력에 남아 있는 소스들을 추가로 처리/분리하기 위해 다양한 프로세스들에서 후속적으로 사용될 수 있다.Referring now to FIG. 8-10, implementations of machine learning processing 800 will now be described in terms of processing methods using machine learning models that contribute to improvements in output signal quality and source separation and/or useful features (e.g., as previously discussed herein). Referring to block 4.1, the machine learning processing may include a complement of the sum separation source(s) as an additional output. In other implementations, the model outputs speech and discarded music/noise. The complement output may be subsequently used in various processes to further process/separate the sources remaining in the complement output.

블록 4.2를 참조하면, 기계 학습 후처리 모델은 클릭들, 고조파 왜곡, 고스팅, 및 광대역 노이즈와 같은 기계 학습 처리에 의해 도입된 아티팩트들을 정리한다(아티팩트들은 예를 들어 도 4를 참조하여 이전에 논의한 바와 같이 결정될 수 있다). 훈련된 소스 분리 모델은 클리킹, 고조파 왜곡, 광대역 노이즈, 및 사운드가 타겟 보완 출력들 사이에서 부분적으로 분리되는 "고스팅"과 같은 아티팩팅을 나타낼 수 있다. 고품질 사운드트랙의 맥락에서 이러한 출력들이 사용되도록 하기 위해서는, 일반적으로 기존의 오디오 복구 소프트웨어를 사용하여 힘든 정리를 시도해야 한다. 이렇게 시도하면 복구된 오디오에서 바람직하지 않은 품질들이 나타날 수 있다. 이는 처리 아티팩트들로 구성된 데이터세트에 대해 훈련된 모델을 사용하여 처리된 오디오를 후처리함으로써 해결될 수 있다. 처리 아티팩트들 데이터세트는 문제가 있는 모델 자체에 의해 생성될 수 있다.Referring to block 4.2, the machine learning post-processing model cleans up artifacts introduced by the machine learning processing, such as clicks, harmonic distortion, ghosting, and wideband noise (artifacts may be determined as previously discussed with reference to FIG. 4, for example). The trained source separation model may exhibit artifacting, such as clicks, harmonic distortion, wideband noise, and "ghosting", where sound is partially separated between the target complementary outputs. In order to make these outputs usable in the context of a high-quality soundtrack, one typically has to attempt arduous cleaning using existing audio restoration software. This can result in undesirable qualities in the restored audio. This can be addressed by post-processing the processed audio using a model trained on a dataset consisting of processing artifacts. The processing artifacts dataset may be generated by the problematic model itself.

후처리 모델(910)은 일단 훈련되면 모든 유사한 모델들에 대해 재사용될 수 있다. 예시된 구현에서, 입력 믹스처(950)는 기계 학습 아티팩트들로 소스 분리 출력들을 생성하는(단계 954)) 일반 모델(952)로 처리된다. 후처리 단계(956)는 아티팩트들을 제거하여 향상된 출력들(960)을 생성한다. 후처리 모델(910)은 분리된 기계 학습 아티팩트들의 데이터세트(914)를 생성하는 단계(단계 912)를 포함한다. 기계 아티팩트들은 클릭들, 고스팅, 광대역 노이즈, 고조파 왜곡, 및 다른 아티팩트들을 포함할 수 있다. 분리된 기계 학습 아티팩트들(912)은 단계 916에서 아티팩트들을 제거하는 모델을 훈련시키는 데 사용된다.The post-processing model (910) once trained can be reused for all similar models. In the illustrated implementation, the input mixture (950) is processed by a generic model (952) which produces source-separated outputs with machine learning artifacts (step 954). The post-processing step (956) removes the artifacts to produce enhanced outputs (960). The post-processing model (910) includes a step (step 912) of producing a dataset (914) of separated machine learning artifacts. The machine artifacts can include clicks, ghosting, broadband noise, harmonic distortion, and other artifacts. The separated machine learning artifacts (912) are used to train a model that removes the artifacts in step 916.

블록 4.3을 참조하면, 일부 구현들에서는 사용자 안내 자기 반복 처리/훈련 접근 방식이 사용될 수 있다. 사용자는 일반 모델에서 가능했던 것보다 더 나은 소스 분리 결과들을 렌더링하는 데 사용될 수 있는 처리/편집/훈련 루프 과정에서 사전 훈련된 모델의 미세 조정을 안내하고 그것에 기여할 수 있다. 사전 훈련된 모델로는 해결할 수 없는 소스 분리 문제를 해결하기 위해 모델 미세 조정 능력을 사용자에게 제공할 수 있다. 일반적으로 충분한 훈련 데이터가 부족하기 때문에, 보이지 않는 믹스처에 대한 소스 분리 모델을 사용한 처리가 항상 성공적인 것은 아니다. 하나의 솔루션에서, 오디오 처리 시스템은 사용자가 더 나은 결과를 렌더링하는데 사용될 수 있는 미리 훈련된 입력을 안내하고 미세 조정할 수 있는 방법을 사용한다. 예시적인 방법에서, i) 사용자는 입력 미디어를 처리하고, ii) 출력을 평가할 기회가 주어지거나, 일부 메트릭들(예컨대 이전에 논의된 이동 RMS 윈도 및/또는 다른 측정들)을 사용하여 측정됨)에 대해 임계 파라미터들을 측정하는 알고리즘에 의해 이 평가가 수행되도록 하는 옵션을 갖고, iii) 출력이 허용 가능한 것으로 간주되면 여기서 처리는 종료되고, 그렇지 않으면 iv) 시간 및/또는 스펙트럼 편집, 및/또는 컬링/증강 알고리즘을 사용하여 불완전한 출력들을 조작할 기회가 주어진다. 임박한 단계 중 가장 도움이 될 출력들의 섹션들을 본질적으로 선택한다. 단계 v)에서 미디어는 이제 훈련 데이터세트에 포함되는 것으로 고려되고, 단계 vi)에서는 사용자에게 자신의 보조 데이터세트를 추가할 수 있는 기회가 제공되고, 단계 vii에서는 사용자의 하이퍼 파라미터 선호도들에 따라 훈련된 모델이 미세 조정되고, 단계 viii)에서는 이전 반복에서 개선된 결과들을 확인하기 위해 모델 성능이 검증된 다음, 단계 ix)에서 프로세스가 반복된다. 다양한 구현들에서, 미세 조정 훈련과 연관된 하이퍼파라미터들은 훈련 세그먼트 길이, 에포크 지속기간, 훈련 스케줄러 유형 및 파라미터들, 옵티마이저 유형 및 파라미터들, 및/또는 다른 하이퍼파라미터들과 같은 파라미터들을 포함할 수 있다. 하이퍼파라미터들은 초기에는 미리 결정된 값들의 세트에 기초한 다음 미세 조정 훈련을 위해 사용자에 의해 수정될 수 있다.Referring to block 4.3, in some implementations a user-guided self-iterative processing/training approach may be used. The user may guide and contribute to the fine-tuning of a pre-trained model during the processing/editing/training loop, which may be used to render better source separation results than would otherwise be possible with a generic model. The ability to fine-tune the model may be provided to the user to solve source separation problems that the pre-trained model cannot solve. Processing using a source separation model on unseen mixtures is not always successful, typically due to the lack of sufficient training data. In one solution, the audio processing system uses a method that allows the user to guide and fine-tune a pre-trained input, which may be used to render better results. In an exemplary method, i) the user processes the input media, ii) is given the opportunity to evaluate the output, or has the option to have this evaluation performed by an algorithm that measures critical parameters for some metrics (e.g., measured using the moving RMS window and/or other measures discussed previously), iii) if the output is deemed acceptable, processing ends here, otherwise iv) is given the opportunity to manipulate the incomplete outputs using temporal and/or spectral editing, and/or culling/augmentation algorithms. Essentially selecting sections of the outputs that will be most helpful during the upcoming steps. In step v) the media is now considered to be included in the training dataset, in step vi) the user is given the opportunity to add his/her auxiliary dataset, in step vii the trained model is fine-tuned according to the user's hyper-parameter preferences, in step viii) the model performance is validated to confirm improvements in results from the previous iteration, and then the process is repeated in step ix). In various implementations, hyperparameters associated with fine-tuning training may include parameters such as training segment length, epoch duration, training scheduler type and parameters, optimizer type and parameters, and/or other hyperparameters. The hyperparameters may initially be based on a set of predetermined values and then modified by the user for fine-tuning training.

예시적인 사용자 안내, 자기 반복 프로세스(1000)가 도 10에 도시되어 있다. 프로세스(1000)는 이전에 논의된 바와 같이 사전 훈련된 모델로서 구현될 수 있는 일반 모델(1002)로 시작한다. 보이지 않는 소스 믹스처를 갖는, 단일 트랙 오디오 신호 또는 복수의 단일 트랙 오디오 신호들과 같은 입력 믹스처(1004)는 단계 1006에서 일반 모델(1002)을 통해 처리되어 기계 학습으로 분리된 소스 신호들(1008) 및 기계 학습으로 분리된 노이즈 신호들(1010)을 포함하는 믹스처로부터 분리된 오디오 신호들을 생성한다. 단계 1012에서, 결과들은 분리된 오디오 소스들이 충분한 품질을 갖는지 확인하기 위해 평가된다(예컨대, 다른 품질 측정 및/또는 앞서 설명된 것과 같은 임계값과 추정된 MOS 비교). 결과들이 양호하다고 결정되면, 1014단계에서 분리된 소스들이 출력된다.An exemplary user-guided, self-repeating process (1000) is illustrated in FIG. 10 . The process (1000) starts with a generic model (1002), which may be implemented as a pre-trained model as previously discussed. An input mixture (1004), such as a single-track audio signal or multiple single-track audio signals having an unseen source mixture, is processed by the generic model (1002) in step 1006 to produce audio signals separated from the mixture, including machine-learned separated source signals (1008) and machine-learned separated noise signals (1010). In step 1012, the results are evaluated to determine if the separated audio sources are of sufficient quality (e.g., by comparing the estimated MOS to another quality measure and/or a threshold as described above). If the results are determined to be good, the separated sources are output in step 1014.

분리된 오디오 소스 신호들에 대해 추가적인 개선이 필요하다고 결정되면, 출력들 중 하나 이상이 단계 1016에서 미세 조정을 위해 훈련 데이터베이스에 포함되도록 준비된다. 자동화된 미세 조정 시스템(1018)에서, 기계 학습으로 분리된 소스(1008)와 노이즈(1010)는 미세 조정 데이터세트(1034)로 직접 사용된다(단계 1022). 미세 조정 데이터세트는 단계 1036에서 소스 분리 메트릭의 사용자 선택 임계값(예컨대, 이동 RMS 윈도를 앞서 설명된 것과 같은 임계값 및/또는 다른 분리 메트릭과 비교)에 기초하여 선택적으로 컬링된다. 그 다음, 모델을 미세 조정하기 위해 훈련이 단계 1038에서 수행된다. 그 다음, 미세 조정된 모델(1032)은 단계(1006)에서 입력 믹스처에 적용된다.If it is determined that further improvement is needed for the separated audio source signals, one or more of the outputs are prepared for inclusion in a training database for fine-tuning at step 1016. In the automated fine-tuning system (1018), the machine learning separated sources (1008) and noises (1010) are used directly as a fine-tuning dataset (1034) at step 1022. The fine-tuning dataset is optionally culled at step 1036 based on a user-selected threshold of a source separation metric (e.g., comparing a moving RMS window to a threshold as described above and/or another separation metric). Training is then performed at step 1038 to fine-tune the model. The fine-tuned model (1032) is then applied to the input mixture at step 1006.

사용자 안내 미세 조정 시스템(1020)에서, 사용자는 미세 조정 데이터세트에 포함시키거나 생략할 오디오 클립의 부분들을 선택할 수 있다(단계 1024 - 임시 편집). 사용자는 또한 미세 조정 데이터세트에 포함/생략하기 위해 클립 및 주파수/시간 윈도 선택들의 일부를 선택할 수도 있다(단계 1026 - 스펙트럼 편집). 일부 구현들에서, 사용자는 미세 조정 데이터세트를 증강하기 위해 더 많은 오디오 클립들을 제공할 수 있다(단계 1028 - 데이터세트에 추가). 일부 구현들에서, 사용자는 미세 조정 데이터세트들을 위한 이퀄라이제이션, 반향, 왜곡 및/또는 다른 증강 설정들을 제공한다(단계 1030 - 증강들). 미세 조정 데이터세트(1034)가 업데이트된 후, 훈련 프로세스는 단계들 1036 및 1038을 통해 계속되어 미세 조정 모델(1032)을 생성한다.In the user guided fine-tuning system (1020), the user can select portions of audio clips to include or omit in the fine-tuned dataset (step 1024 - Temporary Edit). The user can also select portions of clips and frequency/time window selections to include/omit in the fine-tuned dataset (step 1026 - Spectral Edit). In some implementations, the user can provide more audio clips to augment the fine-tuned dataset (step 1028 - Add to Dataset). In some implementations, the user provides equalization, reverb, distortion and/or other augmentation settings for the fine-tuned dataset (step 1030 - Augments). After the fine-tuned dataset (1034) is updated, the training process continues through steps 1036 and 1038 to produce a fine-tuned model (1032).

블록 4.4를 참조하면, 모델 미세 조정 진행들에 대한 애니메이션 시각적 표현이 구현될 수 있다. 사용자 미세 조정이 특정 미디어 클립의 소스 분리를 해결하기 위해 모델을 훈련시키는 동안, 모델의 점진적인 출력들은 사용자의 의사 결정에 도움을 주기 위해 모델의 성능을 나타내는 데 도움이 되도록 디스플레이된다. 일부 구현들에서, 예를 들어, 인터페이스는 미세 조정이 훈련되는 동안 주기적으로 테스트되는 미세 조정 모델에 의해 계산된 추정된 출력들의 주기적으로 업데이트되는 스펙트로그램 애니메이션 표현을 디스플레이하는 관련 툴 아이콘이 있는 윈도에 디스플레이될 수 있다. 이 인터페이스는 사용자가 입력 믹스처의 다양한 영역들에서 모델이 현재 얼마나 잘 수행되고 있는지 시각적으로 평가할 수 있게 허용한다. 인터페이스는 또한 사용자가 스펙트로그램 윈도와의 상호 작용들에 기초하여 이러한 추정 출력들의 시간/주파수 선택들을 실험할 수 있게 허용하는 것과 같은 사용자 상호 작용들을 용이하게 할 수 있다.Referring to block 4.4, an animated visual representation of the model fine-tuning progress may be implemented. While the user fine-tunes the model to solve the source separation of a particular media clip, the model's incremental outputs are displayed to help illustrate the model's performance to aid in the user's decision-making. In some implementations, for example, the interface may display a window with an associated tool icon that displays a periodically updated spectrogram animated representation of the estimated outputs computed by the fine-tuned model that are periodically tested while the fine-tuning is being trained. This interface allows the user to visually assess how well the model is currently performing in various regions of the input mixture. The interface may also facilitate user interactions, such as allowing the user to experiment with time/frequency selections of these estimated outputs based on their interactions with the spectrogram window.

블록 4.5를 참조하면, 미세 조정 훈련을 위한 사용자 안내 증강들이 구현될 수 있다. 반향, 필터링, 비선형 왜곡과 같은 특정 특징들을 가진 타겟팅된 리코딩을 향상/분리할 때 결과들을 개선하기 위해, 오디오 처리 시스템은 사용자에게 블록 4.3에 설명된 루프의 단계 1020 동안의 노이즈 및/또는 반향, 필터링, 비선형 왜곡과 같은 증강 선택을 안내하고 기여하는 기본 알고리즘을 제어하는 툴들을 제공할 수 있다. 사용자는 처리/편집/훈련 루프 과정에서 사전 훈련된 모델을 미세 조정하는 동안 사용되는 증강들 및/또는 증강 파라미터들의 선택을 안내하고 이에 기여할 수 있다. 증강은 미세 조정 후 타겟팅된 동작을 일반화하거나 범위를 좁히는 데 도움이 되도록 각 파라미터(예컨대 반향 증강의 강도, 밀도, 변조, 및/또는 감쇠와 같은 증강들의 다양한 양상들에 영향을 미치는 값들)에 대해 사용자 제어가능 무작위 설정들을 포함한다. 이것은 미세 조정 훈련에 대한 사용자가 갖는 제어의 확장을 허용하여, 이들이 예를 들어 타겟팅된 리코딩에서 필터링된 사운드/반향을 구체적으로 일치시킬 수 있게 한다. 무작위 프로세스들이나 무작위화를 말할 때, 의사랜덤 프로세스나 임의 선택 프로세스를 갖는 것으로 충분할 수도 있다. 일부 구현들에서, 증강 파라미터들은 하나 이상의 알고리즘들을 사용하여 입력 소스 믹스처에 자동으로 일치된다. 일부 구현들에서는 시작점으로 대략적인 일치를 달성하기 위해 자동 매칭이 사용된다. 예를 들어, 타겟 입력 믹스처의 스펙트럼 분석은 무작위 데이터세트 샘플들의 분석과 결합되어 주파수 대역 차이 점수들의 값에 기초하여 데이터세트 증강 필터의 다양한 파라미터들을 조정함으로써 무작위 증강된 데이터세트 샘플들과 타겟 입력 믹스처 사이의 차이를 최소화하는 데 사용될 수 있는 주파수 대역 차이 점수들의 세트를 생성할 수 있다.Referring to block 4.5, user-guided augmentations for fine-tuning training may be implemented. To improve results when enhancing/isolating targeted recordings with specific characteristics such as reverberation, filtering, nonlinear distortion, the audio processing system may provide the user with tools to guide and control the underlying algorithms that contribute to the selection of augmentations such as noise and/or reverberation, filtering, nonlinear distortion during step 1020 of the loop described in block 4.3. The user may guide and contribute to the selection of augmentations and/or augmentation parameters used during fine-tuning of a pre-trained model in the processing/editing/training loop. The augmentations may include user-controllable random settings for each parameter (e.g., values affecting various aspects of the augmentations such as strength, density, modulation, and/or attenuation of reverberation augmentation) to help generalize or narrow the targeted behavior after fine-tuning. This allows for an expansion of the control the user has over the fine-tuning training, allowing them to specifically match the filtered sound/reverberation in the targeted recording, for example. When speaking of random processes or randomization, it may be sufficient to have a pseudo-random process or a random selection process. In some implementations, the augmentation parameters are automatically matched to the input source mixture using one or more algorithms. In some implementations, automatic matching is used to achieve a rough match as a starting point. For example, spectral analysis of the target input mixture can be combined with analysis of random dataset samples to produce a set of frequency band difference scores that can be used to minimize the difference between the randomly augmented dataset samples and the target input mixture by adjusting various parameters of the dataset augmentation filter based on the values of the frequency band difference scores.

도 11a 및 도 11b를 참조하면, 하나 이상의 구현들에 따라 기계 학습 애플리케이션(1100)의 예시적인 구현들이 이제 설명될 것이다. 여기에 개시된 오디오 처리 시스템들 및 방법들은 사운드 포스트 프로덕션 편집 워크플로들의 개선들 및/또는 유용한 기능에 기여한 다른 오디오 처리 애플리케이션들과 함께 사용될 수 있다.Referring now to FIGS. 11A and 11B , exemplary implementations of a machine learning application (1100) according to one or more implementations will now be described. The audio processing systems and methods disclosed herein may be used in conjunction with other audio processing applications that contribute to improvements and/or useful functionality in sound post-production editing workflows.

블록 5.1을 참조하면, 디지털 오디오 워크스테이션(DAW)(예컨대 PRO TOOLS라는 브랜드 이름으로 판매되는 디지털 오디오 워크스테이션은 하나 이상의 구현들에 사용될 수 있음)에서 호스팅되는 AAX(Avid Audio Extension) 플러그인과 같은 플러그인이 제공되며 사용자가 오디오 클립들이 기계 학습 모델들로 후속 처리될 수 있는 독립 실행형 애플리케이션에서 오디오 클립들을 전송할 수 있도록 허용한다. 이 플러그인은 임의 개수의 스템 스플릿들을 DAW 환경으로 다시 반환할 수 있다.Referring to block 5.1, a plug-in, such as an Avid Audio Extension (AAX) plug-in, is provided that is hosted on a digital audio workstation (DAW) (e.g., a digital audio workstation sold under the brand name PRO TOOLS may be used in one or more implementations) that allows a user to send audio clips to a standalone application where the audio clips can be further processed with machine learning models. The plug-in can return an arbitrary number of stem splits back to the DAW environment.

블록 5.2를 참조하면, 본 개시의 구현들은 또한 사용자 선택 기계 학습 방안에 의해 처리되는 미디어를 로드/수신하는 애플리케이션(예컨대, JAM CONNECT와 함께 JAM LAB이라는 브랜드명으로 판매되는 애플리케이션 또는 유사한 애플리케이션)에서도 사용되었다. 일부 구현들에서, 클라이언트 소프트웨어(예컨대, JAM CONNECT를 갖춘 JAM LAB)에 액세스하는 복수의 클라이언트 머신들 l(102A-C) 및 처리 노드 l(106A-D)는 여기에 개시된 클라이언트 소프트웨어 및 기계 학습(ML) 애플리케이션(1100) 둘 다에 액세스하기 위해 작업 관리자/데이터베이스(1104)에 액세스하도록 구성된다.Referring to block 5.2, implementations of the present disclosure have also been used in applications that load/receive media to be processed by user-selected machine learning approaches (e.g., applications sold under the brand name JAM LAB with JAM CONNECT or similar applications). In some implementations, a plurality of client machines l (102A-C) and processing nodes l (106A-D) accessing client software (e.g., JAM LAB with JAM CONNECT) are configured to access a job manager/database (1104) to access both the client software and the machine learning (ML) application (1100) disclosed herein.

도 12를 참조하여, 예시적인 처리 흐름이 이제 설명될 것이다. 디지털 오디오 워크스테이션(1202)(예컨대, JAM CONNECT를 갖춘 PRO TOOLS)을 실행하는 시스템은 오디오 클립들을 클라이언트 애플리케이션(1208)(예컨대, JAM LAB 포함)으로 전송하도록 구성된다. 클라이언트 애플리케이션을 통해, 오디오 클립들에서 오디오 소스들을 처리/분리하기 위한 카테고리들의 목록(1210)에서 단일 모델이 선택될 수 있다. 다중 모델 방안을 구성하기 위해 스템들의 유형들도 단계 1212에서 선택된다. 오디오 클립들 및 방안은 (단계 1214에서) 작업 관리자/데이터베이스(1216)로 전송되며, 이는 단계 1218에서 사용자의 이용 가능한 처리 노드들에 걸쳐 방안 처리를 관리하고 배포한다. 클라이언트 애플리케이션은 단계 1206에서 스템 및/또는 모델 이름들로 라벨링된 처리된 오디오 클립들을 수신하고 반환한다. 일부 구현들에서, 클라이언트 애플리케이션(1208)은 또한 본 명세서에 개시된 다중 모델 방안(1212)를 구성하기 위해 스템들을 선택하는 것을 용이하게 할 수 있다.Referring now to FIG. 12, an exemplary processing flow will now be described. A system running a digital audio workstation (1202) (e.g., PRO TOOLS with JAM CONNECT) is configured to transmit audio clips to a client application (1208) (e.g., JAM LAB). Via the client application, a single model can be selected from a list of categories (1210) for processing/separating audio sources from the audio clips. Types of stems are also selected in step 1212 to construct a multi-model scheme. The audio clips and schemes are transmitted (at step 1214) to a job manager/database (1216), which manages and distributes the scheme processing across the user's available processing nodes in step 1218. The client application receives and returns processed audio clips labeled with stem and/or model names in step 1206. In some implementations, the client application (1208) may also facilitate selecting stems to construct a multi-model scheme (1212) disclosed herein.

기계 학습 다중 모델 방안 처리 시스템의 구현들이 이제 도 13a 및 도 -e를 참조하여 설명될 것이다. 일부 구현들에서, 사용자는 특정 순서 및 계층적 조합으로 소스 분리 모델의 선택을 사용하여 하나의 단계에서 타겟팅된 미디어를 소스 클래스들/스템들의 세트로 분리하기를 원할 수 있다. 타겟팅된 미디어를 사용자 선택 소스 클래스들 또는 스템들의 세트로 분리하기 위해 순차/분기 구조화된 순서로 하나 이상의 소스 분리 모델들을 사용하여 타겟팅된 미디어를 처리하기 위해 순차/분기 소스 분리 방안 스키마가 구현된다.Implementations of the machine learning multi-model scheme processing system will now be described with reference to FIGS. 13a and 13e. In some implementations, a user may wish to separate the targeted media into a set of source classes/stems in a single step using a selection of source separation models in a particular order and hierarchical combination. A sequential/branching source separation scheme scheme is implemented to process the targeted media using one or more source separation models in a sequential/branching structured order to separate the targeted media into a set of user-selected source classes or stems.

도 13a의 구현 1300에서, 방안의 각 단계는 특정 소스 클래스를 목표로 하는 모델들의 조합 또는 소스 분리 모델을 표현한다. 방안은 사용자가 원하는 스템 출력들을 가져오는 단계들을 수행하기 위해 방안 스키마에 따라 정의되며 적절하게 훈련된 모델들을 포함한다. 예시적인 구현에 설명된 것과 같이, 사용자는 먼저 히스(Hiss)를 분리한 다음 보이스(나중에 보컬들과 다른 스피치로 분리), 드럼들(나중에 킥, 스네어 및 다른 타악기로 분리), 오르간, 피아노, 베이스를 분리하고, 다른 처리를 선택할 수 있다. 따라서, 방안에 의해 정의된 파이프라인을 통해 타겟팅된 미디어를 처리함으로써, 다양한 단계들의 출력이 수집되어 궁극적으로 타겟팅된 미디어를 사용자 선택 소스 클래스들 또는 스템들의 세트로 분리할 수 있다.In the implementation 1300 of FIG. 13a, each stage of the solution represents a combination of models or a source separation model targeting a particular source class. The solution is defined according to the solution schema and includes appropriately trained models to perform the steps that result in the stem outputs desired by the user. As illustrated in the exemplary implementation, the user may first separate hiss, then separate voices (later separated into vocals and other speech), drums (later separated into kick, snare and other percussion), organ, piano, bass, and select other processing. Thus, by processing the targeted media through the pipeline defined by the solution, the outputs of the various stages are collected to ultimately separate the targeted media into a set of user-selected source classes or stems.

도 13b를 참조하면, 보이스/드럼들 다른 순차 처리 파이프라인 예(1320)가 도시되어 있다. 이 모델에서는 입력 믹스처가 먼저 처리되어 보이스를 추출하고 보완은 드럼과 다른 사운드들을 믹스처에 포함한다. 드럼 모델은 드럼들을 추출하고 보완은 다른 사운드들을 포함한다. 이 구현에서, 출력은 보이스, 드럼 및 다른 스템들을 포함한다. 순차/분기 분리 시스템으로 소스 클래스들을 분리할 때 모델들이 적용되는 순서는 최적의 처리 순서를 평가하는 알고리즘을 이용하여 더 높은 품질을 위해 최적화될 수 있다. 최적화된 처리 방법(1340)의 예시적인 구현이 도 13c에 도시되어 있다. 예를 들어, A와 B 구성요소들을 포함하는 입력 믹스처 모델은 클래스 A를 분리한 다음, 나머지 B를 분리하도록 구성될 수 있다. 이는 처리 순서가 반대로 된다면(예컨대 B 다음 A를 분리하는 경우) 다른 결과를 낳을 수 있다. 최적화된 처리 방법(1340)은 A와 B를 두 순서들로 분리하고, 결과들을 비교하고, 최상의 결과들(예컨대, 분리된 스템들에서 에러가 더 적은 결과)을 갖는 순서를 선택함으로써 A+B의 입력 믹스처에 대해 동작할 수 있다. 다양한 구현들에서, 최적화된 처리 방법(1340)은 수동으로, 자동으로, 및/또는 하이브리드 접근 방식으로 동작할 수 있다. 예를 들어, 추정된 최적의 순서는 스템 순서의 다양한 순열들을 사용하여 분리될 수 있는 테스트 믹스처들을 생성하기 위해 그라운드 트루스 테스트 샘플들의 세트를 사용하여 사전 확립될 수 있으며, 따라서 추정된 최고 성능 모델 처리 순서는 이러한 테스트들의 출력들을 그라운드 트루스 테스트 샘플들과 비교하는 에러 함수를 사용하여 확립된다.Referring to FIG. 13b, an example sequential processing pipeline (1320) for voices/drums is illustrated. In this model, the input mixture is processed first to extract voices and the complement includes drums and other sounds in the mixture. The drum model extracts drums and the complement includes other sounds. In this implementation, the output includes voices, drums and other stems. When separating source classes with a sequential/branch separation system, the order in which the models are applied can be optimized for higher quality using an algorithm that evaluates the optimal processing order. An exemplary implementation of an optimized processing method (1340) is illustrated in FIG. 13c. For example, an input mixture model containing A and B components can be configured to separate class A and then separate the remaining B. This may lead to different results if the processing order is reversed (e.g., separating B then A). The optimized processing method (1340) can operate on an input mixture of A+B by separating A and B into two orders, comparing the results, and selecting the order with the best results (e.g., the result with less error in the separated stems). In various implementations, the optimized processing method (1340) can operate manually, automatically, and/or in a hybrid approach. For example, the estimated optimal ordering can be pre-established using a set of ground truth test samples to generate test mixtures that can be separated using various permutations of the stem orders, and thus the estimated best-performing model processing order is established using an error function that compares the outputs of these tests to the ground truth test samples.

순차/분기 소스 분리 시스템의 출력 충실도(예컨대, 본 명세서에서 이전에 설명된 바와 같음)를 개선하기 위한 예시적인 파이프라인(1360)이 이제 도 13d를 참조하여 설명될 것이다. 예를 들어, 출력 충실도는 개인의 스피치 샘플들의 컬렉션과 같은 특정 라벨링된 데이터세트에서 스템 출력의 차이를 측정할 수 있는 MOS 알고리즘을 사용하여 측정될 수 있다. 일부 구현들에서, 이러한 알고리즘은 소스를 분류하거나 주어진 데이터세트에서 소스의 차이를 측정하기 위해 사전 훈련된 신경망으로 구현될 수 있다.An exemplary pipeline (1360) for improving the output fidelity (e.g., as previously described herein) of a sequential/branching source separation system will now be described with reference to FIG. 13d . For example, the output fidelity may be measured using a MOS algorithm that can measure the differences in stem outputs in a particular labeled dataset, such as a collection of speech samples of an individual. In some implementations, such an algorithm may be implemented as a pre-trained neural network to classify sources or measure the differences in sources in a given dataset.

처리 방안(5.2.1 참조)는 파이프라인에서 하나 이상의 출력들 뒤에 하나 이상의 후처리 단계들을 포함할 수도 있다. 후처리 단계는 이전 단계들에서 도입되었을 수 있는 신호 아티팩트들/노이즈를 정리하는 모든 유형의 디지털 신호 처리 필터들/알고리즘들을 포함할 수 있다. 파이프라인(1360)은 아티팩트들을 정리하도록 특별히 훈련된 모델들을 사용하여(예컨대, 본 명세서의 단계 4.2에 설명된 것과 같이), 특히 방안 처리 파이프라인의 순차적 특성으로 인해 전체 결과들이 훨씬 향상된다.The processing scheme (see 5.2.1) may include one or more post-processing steps following one or more outputs in the pipeline. The post-processing steps may include any type of digital signal processing filters/algorithms that clean up signal artifacts/noise that may have been introduced in previous steps. The pipeline (1360) uses models specifically trained to clean up artifacts (e.g., as described in step 4.2 herein), resulting in significantly improved overall results, particularly due to the sequential nature of the processing scheme pipeline.

도 13e를 참조하면, 예시적인 구현(1380)은 모델들을 결합하여 특정 소스 클래스를 분리한다. 순차/분기 분리 처리 방안(5.2.1 참조)는 그 소스 클래스를 타겟팅하도록 훈련된 단 하나의 모델만으로는 완전히 추출하지 못하는 믹스처에서 소스 클래스를 추출하기 위해 하나 이상의 모델들이 조합으로 사용되는 단계들을 포함할 수 있다. 설명된 예에서, 드럼들은 두 번 분리된 후 합산되어 보완적인 "다른(Other)" 스템과 함께 단일 "드럼들(Drums)" 출력 스템으로 표시된다.Referring to FIG. 13e, an exemplary implementation (1380) combines models to isolate a particular source class. The sequential/branching separation processing scheme (see 5.2.1) may include steps in which one or more models are used in combination to extract a source class from a mixture that cannot be fully extracted by a single model trained to target that source class. In the illustrated example, the drums are split twice and then summed to represent a single "Drums" output stem along with a complementary "Other" stem.

본 명세서에 개시된 시스템 및 방법을 구현하기 위한 예시적인 오디오 처리 시스템(1400)이 이제 도 14를 참조하여 설명될 것이다. 오디오 처리 시스템(1400)은 논리 장치(1402), 메모리(1404), 통신 구성요소들(1422), 디스플레이(1418), 사용자 인터페이스(1420), 및 데이터 스토리지(1430)를 포함한다.An exemplary audio processing system (1400) for implementing the systems and methods disclosed herein will now be described with reference to FIG. 14. The audio processing system (1400) includes a logic device (1402), memory (1404), communication components (1422), a display (1418), a user interface (1420), and data storage (1430).

논리 장치(1402)는 예를 들어, 마이크로프로세서, 단일 코어 프로세서, 멀티 코어 프로세서, 마이크로컨트롤러, 처리 동작들을 수행하도록 구성된 프로그래밍 가능 논리 장치, DSP 장치, 실행 가능한 명령들(예컨대, 소프트웨어, 펌웨어, 또는 다른 명령들)을 저장하기 위한 하나 이상의 메모리, 그래픽 처리 유닛 및/또는 여기에 설명된 다양한 동작들 중 어느 하나를 수행하기 위한 명령들을 실행하도록 구성된 처리 장치 및/또는 메모리의 임의의 다른 적절한 조합을 포함할 수 있다. 논리 장치(1402)는 메모리(1404), 통신 구성요소들(1422), 디스플레이(1418), 사용자 인터페이스(1420), 및 데이터 스토리지(1430)를 포함하는 오디오 처리 시스템(1400)의 다양한 구성요소들과 인터페이스하고 통신하도록 구성된다.The logic device (1402) may include, for example, a microprocessor, a single core processor, a multi-core processor, a microcontroller, a programmable logic device configured to perform processing operations, a DSP device, one or more memories for storing executable instructions (e.g., software, firmware, or other instructions), a graphics processing unit, and/or any other suitable combination of processing devices and/or memories configured to execute instructions for performing any of the various operations described herein. The logic device (1402) is configured to interface and communicate with various components of the audio processing system (1400), including memory (1404), communication components (1422), a display (1418), a user interface (1420), and data storage (1430).

통신 구성요소들(1422)은 네트워크 또는 원격 시스템과의 통신을 용이하게 하는 유선 및 무선 통신 인터페이스들을 포함할 수 있다. 유선 통신 인터페이스들은 케이블들이나 다른 유선 통신 인터페이스들과 같은 장치 연결 인터페이스들 또는 하나 이상의 물리적 네트워크들로 구현될 수 있다. 무선 통신 인터페이스들은 하나 이상의 Wi-Fi, 블루투스, 셀룰러, 적외선, 라디오, 및/또는 무선 통신들을 위한 다른 유형들의 네트워크 인터페이스들로 구현될 수 있다. 통신 구성요소들(1422)은 동작 중 무선 통신을 위한 안테나를 포함할 수 있다.The communication components (1422) may include wired and wireless communication interfaces that facilitate communication with a network or remote system. The wired communication interfaces may be implemented as device connection interfaces such as cables or other wired communication interfaces or as one or more physical networks. The wireless communication interfaces may be implemented as one or more Wi-Fi, Bluetooth, cellular, infrared, radio, and/or other types of network interfaces for wireless communications. The communication components (1422) may include an antenna for wireless communications during operation.

디스플레이(1418)는 이미지 디스플레이 장치(예컨대, 액정 디스플레이(LCD)) 또는 일반적으로 알려진 다양한 다른 유형들의 비디오 디스플레이들 또는 모니터들을 포함할 수 있다. 사용자 인터페이스(1420)는 다양한 구현들에서 키보드, 제어 패널 유닛, 그래픽 사용자 인터페이스, 또는 다른 사용자 입력/출력과 같은 사용자 입력 및/또는 인터페이스 장치를 포함할 수 있다. 디스플레이(1418)는 사용자 입력 장치 및 디스플레이 장치, 예를 들어 디스플레이 스크린의 다른 부분들을 터치하는 사용자로부터 입력 신호들을 수신하도록 적응된 터치 스크린 장치 모두로서 작동할 수 있다.The display (1418) may include an image display device (e.g., a liquid crystal display (LCD)) or various other types of commonly known video displays or monitors. The user interface (1420) may, in various implementations, include a user input and/or interface device, such as a keyboard, a control panel unit, a graphical user interface, or other user input/output. The display (1418) may act as both a user input device and a touch screen device adapted to receive input signals from a user touching different portions of the display device, for example, a display screen.

메모리(1404)는 오디오 소스 분리 툴들(1406), 코어 모델 조작들(1408), 기계 학습 훈련(1410), 훈련된 오디오 분리 모델들(1412), 오디오 처리 애플리케이션들(1414), 및 자기 반복 처리/훈련 로직(1416)을 포함하지만 이에 제한되지 않는 본 명세서에 개시된 시스템 및 방법을 구현하기 위한 프로그램 로직을 포함하는 논리 장치(1402)에 의한 실행을 위한 프로그램 명령들을 저장한다. 오디오 처리 시스템(1400)에 의해 사용되는 데이터는 메모리(1404) 및/또는 데이터 스토리지(1430)에 저장될 수 있고, 기계 학습 스피치 데이터세트들(1432), 기계 학습 음악/노이즈 데이터세트들(1434), 오디오 스템들(1436), 오디오 믹스처들(1438), 및/또는 다른 데이터를 포함할 수 있다.Memory (1404) stores program instructions for execution by logic device (1402) including program logic for implementing systems and methods disclosed herein, including but not limited to audio source separation tools (1406), core model manipulations (1408), machine learning training (1410), trained audio separation models (1412), audio processing applications (1414), and self-iterative processing/training logic (1416). Data used by audio processing system (1400) may be stored in memory (1404) and/or data storage (1430), and may include machine learning speech datasets (1432), machine learning music/noise datasets (1434), audio stems (1436), audio mixtures (1438), and/or other data.

일부 구현들에서, 하나 이상의 프로세스들은 여기에 설명된 오디오 처리 시스템(1400)으로 구현될 수 있는 클라우드 플랫폼과 같은 원격 처리 시스템을 통해 구현될 수 있다.In some implementations, one or more of the processes may be implemented via a remote processing system, such as a cloud platform, that may be implemented with the audio processing system (1400) described herein.

도 15는 본 명세서에 설명된 다양한 RNN들 및 모델들을 포함하는, 도 1 내지 도 14의 구현들 중 하나 이상에서 사용될 수 있는 예시적인 신경망을 도시한다. 신경망(1500)은 순환 신경망, 심층 신경망, 컨벌루션 신경망 또는 라벨링된 훈련 데이터세트(1510)를 수신하여 각 입력 오디오 샘플에 대한 오디오 출력(1512)(예컨대 하나 이상의 오디오 스템들)을 생성하는 다른 적합한 신경망으로 구현된다. 다양한 구현들에서, 라벨링된 훈련 데이터세트(1510)는 본 명세서에 설명된 바와 같은 다양한 오디오 샘플들 및 훈련 믹스처들, 예를 들어 훈련 데이터세트들(도 3), 자기 반복 훈련 데이터세트 또는 컬링된 데이터세트(도 4), 본 명세서에 기술된 훈련 방법들에 따라 설명된 훈련 믹스처들 및 데이터세트들(도 5-13e), 또는 다른 적절한 훈련 데이터세트를 포함할 수 있다.FIG. 15 illustrates an example neural network that may be used in one or more of the implementations of FIGS. 1-14 , including various RNNs and models described herein. The neural network (1500) is implemented as a recurrent neural network, a deep neural network, a convolutional neural network, or other suitable neural network that receives a labeled training dataset (1510) and generates audio output (1512) (e.g., one or more audio stems) for each input audio sample. In various implementations, the labeled training dataset (1510) may include various audio samples and training mixtures as described herein, such as training datasets ( FIG. 3 ), a self-repeating training dataset or a culled dataset ( FIG. 4 ), training mixtures and datasets described according to the training methods described herein ( FIGS. 5-13e ), or other suitable training dataset.

훈련된 신경망 모델을 생성하기 위한 훈련 프로세스는 오디오 스템 또는 다른 원하는 오디오 출력(1512)을 생성하기 위해 신경망(1500)을 통한 포워드 패스를 포함한다. 각 데이터 샘플은 오디오 출력(1512)과 비교되는 신경망(1500)의 원하는 출력으로 라벨링된다. 일부 구현들에서, 비용 함수가 오디오 출력(1512)의 에러를 정량화하기 위해 적용되고 신경망(1500)을 통한 백워드 패스는 출력 에러를 최소화하기 위해 신경망 계수들을 조정하는 데 사용될 수 있다.The training process for generating a trained neural network model includes a forward pass through the neural network (1500) to generate audio stems or other desired audio output (1512). Each data sample is labeled with a desired output of the neural network (1500) that is compared to the audio output (1512). In some implementations, a cost function is applied to quantify error in the audio output (1512) and a backward pass through the neural network (1500) can be used to adjust the neural network coefficients to minimize the output error.

그런 다음 훈련된 신경망(1500)은 검증을 위해 따로 마련된 라벨링된 훈련 데이터(1510)의 서브세트를 사용하여 정확도가 테스트될 수 있다. 훈련된 신경망(1500)은 본 명세서에 설명된 바와 같이 오디오 소스 분리를 수행하기 위해 런타임 환경에서 모델로서 구현될 수 있다.The trained neural network (1500) can then be tested for accuracy using a subset of labeled training data (1510) set aside for validation. The trained neural network (1500) can be implemented as a model in a runtime environment to perform audio source separation as described herein.

다양한 구현들에서, 신경망(1500)은 입력 레이어(1520)를 사용하여 입력 데이터(예컨대, 오디오 샘플들)를 처리한다. 일부 예들에서, 입력 데이터는 본 명세서에서 이전에 설명된 오디오 샘플들 및/또는 오디오 입력에 대응할 수 있다.In various implementations, the neural network (1500) processes input data (e.g., audio samples) using an input layer (1520). In some examples, the input data may correspond to audio samples and/or audio input previously described herein.

입력 레이어(1520)는 특징 추출, 스케일링, 샘플링 레이트 변환 등을 포함할 수 있는, 신경망(1500)에 입력하기 위한 입력 오디오 데이터를 조절하는 데 사용되는 복수의 뉴런들을 포함한다. 입력 레이어(1520)의 뉴런들 각각은 하나 이상의 히든 레이어들(hidden layers)(1530)의 입력들에 공급되는 출력을 생성한다. 히든 레이어(1530)는 입력 레이어(1520)의 출력들을 처리하는 복수의 뉴런들을 포함한다. 일부 예들에서, 히든 레이어(1530)의 뉴런들 각각은 출력을 생성하고, 출력은 이후 이전의 히든 레이어들로부터의 출력들을 처리하는 복수의 뉴런들을 포함하는 추가 히든 레이어들을 통해 집합적으로 전파된다. 히든 레이어(들)(1530)의 출력들은 출력 레이어(1540)에 공급된다. 출력 레이어(1540)는 원하는 출력을 생성하기 위해 히든 레이어(1530)로부터의 출력을 조절하는 데 사용되는 하나 이상의 뉴런들을 포함한다. 신경망(1500)의 아키텍처는 대표적일 뿐이며, 하나의 히든 레이어만을 갖는 신경망, 입력 레이어 및/또는 출력 레이어가 없는 신경망, 순환 레이어들을 갖는 신경망 등을 포함하는 다른 아키텍처들이 가능하다는 것이 이해되어야 한다.The input layer (1520) includes a plurality of neurons used to condition input audio data for input to the neural network (1500), which may include feature extraction, scaling, sampling rate conversion, etc. Each of the neurons in the input layer (1520) generates an output that is fed to inputs of one or more hidden layers (1530). The hidden layer (1530) includes a plurality of neurons that process the outputs of the input layer (1520). In some examples, each of the neurons in the hidden layer (1530) generates an output, which is then collectively propagated through additional hidden layers that include a plurality of neurons that process outputs from previous hidden layers. The outputs of the hidden layer(s) (1530) are fed to an output layer (1540). The output layer (1540) includes one or more neurons used to condition the output from the hidden layer (1530) to generate a desired output. It should be understood that the architecture of the neural network (1500) is representative only and that other architectures are possible, including neural networks having only one hidden layer, neural networks without input and/or output layers, neural networks having recurrent layers, etc.

일부 예들에서, 입력 레이어(1520), 히든 레이어(들)(1530), 및/또는 출력 레이어(1540) 각각은 하나 이상의 뉴런들을 포함한다. 일부 예들에서, 입력 레이어(1520), 히든 레이어(들)(1530), 및/또는 출력 레이어(1540) 각각은 동일한 수 또는 상이한 수의 뉴런들을 포함할 수 있다. 일부 예들에서, 뉴런들 각각은 입력들 x의 조합(예컨대 훈련 가능한 가중 행렬 W를 사용한 가중 합)을 취하고, 선택적 훈련 가능한 바이어스 b를 추가하고, 활성화 함수 f를 적용하여 방정식 a=f(Wx+b)으로 나타낸 것과 같은 출력 a를 생성한다.일부 예들에서, 활성화 함수 f는 선형 활성화 함수, 상한 및/또는 하한을 갖는 활성화 함수, 로그-시그모이드 함수(log-sigmoid function), 쌍곡선 탄젠트 함수, 정류된 선형 단위 함수(rectified linear unit function) 등일 수 있다. 일부 예들에서, 뉴런들 각각은 동일하거나 상이한 활성화 함수를 가질 수 있다.In some examples, each of the input layer (1520), the hidden layer(s) (1530), and/or the output layer (1540) includes one or more neurons. In some examples, each of the input layer (1520), the hidden layer(s) (1530), and/or the output layer (1540) can include the same number or different numbers of neurons. In some examples, each of the neurons takes a combination of inputs x (e.g., a weighted sum using a trainable weight matrix W), adds an optional trainable bias b, and applies an activation function f to produce an output a, such as represented by the equation a=f(Wx+b).In some examples, the activation function f can be a linear activation function, an activation function with upper and/or lower bounds, a log-sigmoid function, a hyperbolic tangent function, a rectified linear unit function, etc. In some examples, each of the neurons can have the same or different activation functions.

일부 예들에서, 신경망(1500)은 입력 데이터와 그라운드 트루스(예컨대 예상) 출력 데이터의 조합을 포함하는 훈련 데이터의 조합들이 있는 감시 학습(supervised learning)을 사용하여 훈련될 수 있다. 생성된 오디오 출력(1512)과 그라운드 트루스 출력 데이터(예컨대, 라벨) 사이의 차이는 신경망(1500)으로 피드백되어 다양한 훈련 가능한 가중치들 및 바이어스들을 수정한다. 일부 예들에서, 차이들은 확률적 경사 하강 알고리즘(stochastic gradient descent algorithm) 등을 사용하는 역전파 기술을 사용하여 피드백될 수 있다.In some examples, the neural network (1500) may be trained using supervised learning where combinations of training data include combinations of input data and ground truth (e.g., expected) output data. Differences between the generated audio output (1512) and the ground truth output data (e.g., labels) are fed back into the neural network (1500) to modify various trainable weights and biases. In some examples, the differences may be fed back using a backpropagation technique, such as using a stochastic gradient descent algorithm.

일부 예들에서, 훈련 데이터 조합들의 큰 세트는 전체 비용 함수(예컨대, 각 훈련 조합의 차이들에 기초한 평균 제곱 오차)가 허용 가능한 레벨로 수렴할 때까지 신경망(1500)에 여러 번 제공될 수 있다.In some examples, a large set of training data combinations may be provided to the neural network (1500) multiple times until the overall cost function (e.g., the mean squared error based on the differences between each training combination) converges to an acceptable level.

예시적인 구현들은 다음과 같다:Example implementations include:

1. 단일 트랙 오디오 믹스처로부터 하나 이상의 오디오 소스 신호들을 분리하도록 훈련된 심층 신경망(DNN)을 포함하는, 오디오 처리 시스템.1. An audio processing system comprising a deep neural network (DNN) trained to separate one or more audio source signals from a single track audio mixture.

2. 예 1의 오디오 처리 시스템에 있어서, DNN은 시간 도메인 인코딩 및/또는 시간 도메인 디코딩 없이 신호 입력을 수신하고 신호 출력을 생성하도록 구성되는, 오디오 처리 시스템.2. An audio processing system according to Example 1, wherein the DNN is configured to receive a signal input and generate a signal output without time domain encoding and/or time domain decoding.

3. 예 1-2의 오디오 처리 시스템에 있어서, DNN은 윈도잉 함수를 적용하도록 구성되는, 오디오 처리 시스템.3. An audio processing system, wherein the DNN is configured to apply a windowing function in the audio processing system of Example 1-2.

4. 예 1-3의 오디오 처리 시스템에 있어서, DNN은 밴딩 아티팩트들을 부드럽게 하기 위해 오버랩 추가 프로세스를 수행하는, 오디오 처리 시스템.4. An audio processing system in which the DNN performs an overlap addition process to smooth out banding artifacts in the audio processing system of Example 1-3.

5. 예 1-4의 오디오 처리 시스템에 있어서, 오디오 소스 분리는 마스크를 적용하지 않고 수행되는, 오디오 처리 시스템.5. An audio processing system according to Example 1-4, wherein audio source separation is performed without applying a mask.

6. 예 1-5의 오디오 처리 시스템에 있어서, DNN 모델은 48kHz의 샘플 레이트를 사용하여 훈련되는, 오디오 처리 시스템.6. An audio processing system, in which the DNN model is trained using a sample rate of 48 kHz in the audio processing system of Example 1-5.

7. 예 1-6의 오디오 처리 시스템에 있어서, 신호 처리 파이프라인은 48kHz로 작동하는, 오디오 처리 시스템.7. In the audio processing system of Example 1-6, the signal processing pipeline is an audio processing system that operates at 48 kHz.

8. 예 1-7의 오디오 처리 시스템에 있어서, 입력 오디오 신호에 적용되는 분리 프로세스의 강도를 제어하는 분리 강도 파라미터를 더 포함하는, 오디오 처리 시스템.8. An audio processing system according to example 1-7, further comprising a separation strength parameter that controls the strength of a separation process applied to an input audio signal.

9. 예 1 내지 8의 오디오 처리 시스템에 있어서, 복수의 라벨링된 스피치 샘플들을 포함하는 스피치 훈련 데이터세트를 더 포함하는, 오디오 처리 시스템.9. An audio processing system according to any one of claims 1 to 8, further comprising a speech training dataset including a plurality of labeled speech samples.

10. 예 1 내지 9의 오디오 처리 시스템에 있어서, 복수의 라벨링된 음악 및/또는 노이즈 데이터 샘플들을 포함하는 넌스피치 훈련 데이터세트를 더 포함하는, 오디오 처리 시스템.10. An audio processing system according to any one of claims 1 to 9, further comprising a nonspeech training dataset comprising a plurality of labeled music and/or noise data samples.

11. 예 1-10의 오디오 처리 시스템에 있어서, DNN 모델을 훈련시키는데 사용할 레이블링된 오디오 샘플들을 생성하도록 구성된 데이터세트 생성 모듈을 더 포함하는, 오디오 처리 시스템.11. An audio processing system according to example 1-10, further comprising a dataset generation module configured to generate labeled audio samples to be used for training a DNN model.

12. 예 1-11의 오디오 처리 시스템에 있어서, 데이터세트 생성 모듈은 자기 반복 데이터세트 생성기인, 오디오 처리 시스템.12. An audio processing system according to Example 1-11, wherein the dataset generation module is a self-repeating dataset generator.

13. 예 1-12의 오디오 처리 시스템에 있어서, 데이터세트 생성 데이터세트 생성 모듈은 입력 오디오 믹스처로부터 라벨링된 오디오 샘플들을 생성하고 DNN으로부터 출력된 오디오 소스 스템들을 생성하도록 구성되는, 오디오 처리 시스템.13. An audio processing system according to Example 1-12, wherein the dataset generation module is configured to generate labeled audio samples from an input audio mixture and generate audio source stems output from a DNN.

14. 예 1-13의 오디오 처리 시스템에 있어서, 사전/사후 믹스처 증강을 적용하도록 구성된 데이터 로더를 더 포함하는, 오디오 처리 시스템.14. An audio processing system according to example 1-13, further comprising a data loader configured to apply pre/post mixture enhancement.

15. 예 1-14의 오디오 처리 시스템에 있어서, DNN이 가청 주파수보다 높은 주파수에서 훈련되어 낮은 가청 주파수 범위에서 오디오의 개별 스템들을 인식하는, 오디오 처리 시스템.15. An audio processing system according to Example 1-14, wherein the DNN is trained at frequencies higher than the audible frequency to recognize individual stems of audio in a low audible frequency range.

16. 예 1-15의 오디오 처리 시스템에 있어서, 데이터 로더는 반향, 필터 확률 파라미터들과 같은 증강들을 적용하도록 구성되는, 오디오 처리 시스템.16. An audio processing system according to example 1-15, wherein the data loader is configured to apply augmentations such as reverberation and filter probability parameters.

17. 예 1-16의 오디오 처리 시스템에 있어서, 상대 신호 레벨들에 기초하여 훈련 중에 보이지 않는 타겟 믹스처들의 음질들 일치하도록 추가로 구성되는, 오디오 처리 시스템.17. An audio processing system according to example 1-16, further configured to match sound qualities of target mixtures that are not visible during training based on relative signal levels.

18. 소스 분리를 위해 훈련된 추론 모델을 사용하여 오디오 입력 데이터를 처리하여 소스 분리된 스템들을 생성하는 단계;18. A step of processing audio input data using the trained inference model for source separation to generate source-separated stems;

소스 분리된 스템들로부터 스피치 데이터세트를 생성하는 단계;A step of generating a speech dataset from source separated stems;

소스 분리된 스템들로부터 노이즈 데이터세트를 생성하는 단계;A step of generating a noise dataset from source separated stems;

스피치 데이터세트 및 노이즈 데이터세트를 사용하여 추론 모델을 훈련하여 업데이트된 추론 모델 생성하는 단계를 포함하는, 예시적인 방법.An exemplary method comprising the steps of training an inference model using a speech dataset and a noise dataset to produce an updated inference model.

19. 예 18의 방법에 있어서, 업데이트된 추론 모델을 사용하여 오디오 입력 데이터를 처리하는 단계를 더 포함하는, 방법.19. A method according to example 18, further comprising the step of processing audio input data using an updated inference model.

20. 예 18-19의 방법들에 있어서, 업데이트된 추론 모델을 반복적으로 업데이트하는 단계를 더 포함하는, 방법.20. A method according to any of the methods of examples 18-19, further comprising the step of repeatedly updating an updated inference model.

21. 예 19-20의 방법들에 있어서, 훈련 데이터세트는 오디오 소스들에 근접하는 샘플들을 포함하도록 큐레이팅되는, 방법.21. A method according to any of the methods of examples 19-20, wherein the training dataset is curated to include samples that approximate the audio sources.

22. 예 19-21의 방법들에 있어서, 계층적 믹스 버스 스키마를 포함하는, 방법.22. A method according to any of the methods of Examples 19-21, comprising a hierarchical mix bus scheme.

23. 예 19-22의 방법에 있어서, 추론 모델은 다중 음성 소스 믹스처들을 사용하여 훈련되는, 방법.23. A method according to example 19-22, wherein the inference model is trained using multiple speech source mixtures.

24. 예 19-23의 방법들에 있어서, 추론 모델은 전경 보이스들, 배경 보이스들, 및/또는 원거리 보이스들을 사용하여 훈련되는, 방법.24. A method according to any of the methods of examples 19-23, wherein the inference model is trained using foreground voices, background voices, and/or distant voices.

25. 예 19-24의 방법들에 있어서, 추론 모델은 제 1 샘플 레이트로 훈련되고 더 높은 샘플 레이트로 업스케일링되는, 방법.25. In the methods of examples 19-24, the inference model is trained at a first sample rate and upscaled to a higher sample rate.

26. 예 19-25의 방법들에 있어서, 소스 분리 프로세스에 의해 도입된 아티팩트들을 제거하기 위해 분리된 오디오 소스 스템들을 후처리하는 단계를 더 포함하는, 방법.26. A method according to any one of claims 19-25, further comprising the step of post-processing the separated audio source stems to remove artifacts introduced by the source separation process.

27. 예 19-26의 방법들에 있어서, 상기 소스 분리된 스템들은 분리된 소스 신호 및 나머지 보완 신호를 포함하는, 방법.27. A method according to any one of the methods of Examples 19-26, wherein the source separated stems include separated source signals and remaining complementary signals.

28. 예 19-27의 방법들에 있어서, 소스 분리 프로세스 동안 도입된 아티팩트들은 클릭들, 고조파 왜곡, 고스팅 및/또는 광대역 노이즈를 포함하는, 방법.28. In the methods of examples 19-27, the artifacts introduced during the source separation process include clicks, harmonic distortion, ghosting, and/or broadband noise.

29. 예 19-28의 방법들에 있어서, 미세 조정 프로세스는 사용자 안내 자기 반복 처리를 포함하는, 방법.29. A method according to any of the methods of Examples 19-28, wherein the fine-tuning process comprises user-guided self-iterative processing.

30. 예 19-29의 방법들에 있어서, 훈련의 미세 조정을 위한 사용자 안내 증강들을 용이하게 하는 단계를 더 포함하는, 방법.30. A method according to any of claims 19-29, further comprising a step of facilitating user-guided augmentations for fine-tuning of the training.

31. 복수의 오디오 소스들로부터 생성된 오디오 신호들의 믹스처를 포함하는 을 포함하는 오디오 입력 스트림을 수신하도록 구성된 오디오 입력, 오디오 입력 스트림을 수신하고 생성된 복수의 오디오 스템들을 생성하도록 구성된 훈련된 오디오 소스 분리 모델로서, 생성된 복수의 오디오 스템들은 복수의 오디오 소스들 중 하나 이상의 오디오 소스들에 대응하는, 상기 훈련된 오디오 소스 분리 모델, 및 생성된 복수의 오디오 스템들에 적어도 부분적으로 기초하여 훈련된 오디오 소스 분리 모델을 업데이트된 오디오 소스 분리 모델로 업데이트하도록 구성된 자기 반복 훈련 시스템으로서, 업데이트된 오디오 소스 분리 모델은 오디오 입력 스트림을 재처리하여 복수의 향상된 오디오 스템들을 생성하도록 구성되는, 상기 자기 반복 훈련 시스템을 포함하는, 시스템.31. A system comprising: an audio input configured to receive an audio input stream comprising a mixture of audio signals generated from a plurality of audio sources; a trained audio source separation model configured to receive the audio input stream and generate a plurality of generated audio stems, wherein the generated plurality of audio stems correspond to one or more of the plurality of audio sources; and a self-repeating training system configured to update the trained audio source separation model with an updated audio source separation model based at least in part on the generated plurality of audio stems, wherein the updated audio source separation model is configured to reprocess the audio input stream to generate a plurality of enhanced audio stems.

32. 예 31의 시스템에 있어서, 오디오 입력 스트림은 하나 이상의 단일 트랙 오디오 믹스처들을 포함하며, 훈련된 오디오 소스 분리 모델은 하나 이상의 단일 트랙 오디오 믹스처로부터 하나 이상의 오디오 소스 신호들을 분리하도록 훈련된 신경망을 포함하는, 시스템.32. A system according to claim 31, wherein the audio input stream comprises one or more single-track audio mixtures, and the trained audio source separation model comprises a neural network trained to separate one or more audio source signals from the one or more single-track audio mixtures.

33. 예 31-32의 시스템에 있어서, 신경망은 마스크를 적용하지 않고 오디오 소스 분리를 수행하도록 구성되는, 시스템.33. A system according to any one of claims 31 to 32, wherein the neural network is configured to perform audio source separation without applying a mask.

34. 예 31-33의 시스템에 있어서, 라벨링된 소스 오디오 데이터 및 라벨링된 노이즈 오디오 데이터를 포함하는 훈련 데이터세트를 더 포함하고, 훈련된 오디오 소스 분리 모델이 훈련 데이터세트를 사용하여 훈련되어 일반 소스 분리 모델을 생성하는, 시스템.34. A system according to any one of claims 31 to 33, further comprising a training dataset including labeled source audio data and labeled noise audio data, wherein a trained audio source separation model is trained using the training dataset to generate a general source separation model.

35. 예 31-34의 시스템에 있어서, 생성된 복수의 오디오 스템들의 적어도 서브세트는 임계값 메트릭에 기초하여 컬링되고, 컬링된 동적으로 진화하는 데이터세트를 형성하기 위해 훈련 데이터세트에 추가되고, 상기 컬링된 동적으로 진화하는 데이터세트가 업데이트된 오디오 소스 분리 모델을 훈련시키는 데 사용되는, 시스템.35. A system according to any one of claims 31-34, wherein at least a subset of the generated plurality of audio stems are culled based on a threshold metric and added to the training dataset to form a culled dynamically evolving dataset, wherein the culled dynamically evolving dataset is used to train an updated audio source separation model.

36. 예 31-35의 시스템에 있어서, 자기 반복 훈련 시스템은 또한 생성된 복수의 오디오 스템들과 연관된 제 1 품질 메트릭을 계산하도록 구성되고, 제 1 품질 메트릭은 훈련된 오디오 소스 분리 모델의 제 1 성능 측정치를 제공하고, 자기 반복 훈련 시스템은 또한 향상된 오디오 스템들과 연관된 제 2 품질 메트릭을 계산하도록 구성되고, 제 2 품질 메트릭은 업데이트된 오디오 소스 분리 모델의 제 2 성능 측정치를 제공하고, 상기 제 2 품질 메트릭은 제 1 품질 메트릭보다 큰, 시스템.36. In the system of examples 31-35, the self-repeating training system is further configured to compute a first quality metric associated with the generated plurality of audio stems, wherein the first quality metric provides a first performance measure of the trained audio source separation model, and the self-repeating training system is further configured to compute a second quality metric associated with the improved audio stems, wherein the second quality metric provides a second performance measure of the updated audio source separation model, wherein the second quality metric is greater than the first quality metric.

37. 예 31-36의 시스템에 있어서, 훈련된 오디오 소스 분리 모델은 복수의 데이터세트들을 포함하는 훈련 데이터세트를 사용하여 훈련되고, 복수의 데이터세트들 각각은 소스 분리 문제를 해결하기 위해 시스템을 훈련하도록 구성된 라벨링된 오디오 샘플들을 포함하는, 시스템.37. In the system of example 31-36, the trained audio source separation model is trained using a training dataset including a plurality of datasets, each of the plurality of datasets including labeled audio samples configured to train the system to solve the source separation problem.

38. 예 31-37의 시스템에 있어서, 복수의 데이터세트들은 복수의 라벨링된 스피치 샘플들을 포함하는 스피치 훈련 데이터세트, 및/또는 복수의 라벨링된 음악 및/또는 노이즈 데이터 샘플들을 포함하는 넌스피치 훈련 데이터세트를 포함하는, 시스템.38. A system according to any one of claims 31-37, wherein the plurality of datasets comprises a speech training dataset comprising a plurality of labeled speech samples, and/or a nonspeech training dataset comprising a plurality of labeled music and/or noise data samples.

39. 예 31-38의 시스템에 있어서, 자기 반복 훈련 시스템은 생성된 복수의 오디오 스템들로부터 라벨링된 오디오 샘플들을 생성하도록 구성된 자기 반복 데이터세트 생성 모듈을 더 포함하는, 시스템.39. In the system of example 31-38, the self-repeating training system further comprises a self-repeating dataset generation module configured to generate labeled audio samples from the plurality of generated audio stems.

40. 예 31-39의 시스템에 있어서, 복수의 향상된 오디오 스템들은 소스 신호와 나머지 보완 신호를 분리하는 것을 포함하는 계층적 분기 시퀀스를 사용하여 생성되는, 시스템.40. A system according to any one of claims 31-39, wherein the plurality of enhanced audio stems are generated using a hierarchical branching sequence that includes separating the source signal and the remaining complementary signal.

41. 복수의 오디오 소스들로부터 생성된 오디오 신호들의 믹스처를 포함하는 오디오 입력 스트림을 수신하는 단계, 오디오 입력 스트림을 수신하도록 구성된 훈련된 오디오 소스 분리 모델을 사용하여 복수의 오디오 소스들 중 하나 이상의 오디오 소스들에 대응하는 생성된 복수의 오디오 스템들을 생성하는 단계, 자기 반복 훈련 프로세스를 사용하여 훈련된 오디오 소스 분리 모델을 적어도 부분적으로 생성된 복수의 오디오 스템들에 기초하여 업데이트된 오디오 소스 분리 모델로 업데이트하는 단계, 업데이트된 오디오 소스 분리 모델을 사용하여 오디오 입력 스트림을 재처리하여 복수의 향상된 오디오 스템들을 생성하는 단계를 포함하는, 방법.41. A method, comprising: receiving an audio input stream comprising a mixture of audio signals generated from a plurality of audio sources; generating a plurality of generated audio stems corresponding to one or more of the plurality of audio sources using a trained audio source separation model configured to receive the audio input stream; updating the trained audio source separation model using a self-recurring training process with an updated audio source separation model based at least in part on the plurality of generated audio stems; and reprocessing the audio input stream using the updated audio source separation model to generate a plurality of enhanced audio stems.

42. 예 41의 방법에 있어서, 오디오 입력 스트림은 하나 이상의 단일 트랙 오디오 믹스처들을 포함하며, 훈련된 오디오 소스 분리 모델은 하나 이상의 단일 트랙 오디오 믹스처들로부터 하나 이상의 오디오 소스 신호들을 분리하도록 훈련된 신경망을 포함하는, 방법.42. A method according to example 41, wherein the audio input stream comprises one or more single-track audio mixtures, and wherein the trained audio source separation model comprises a neural network trained to separate one or more audio source signals from the one or more single-track audio mixtures.

43. 예 41-42의 방법에 있어서, 신경망은 마스크를 적용하지 않고 오디오 소스 분리를 수행하도록 구성되는, 방법.43. A method according to any one of claims 41 to 42, wherein the neural network is configured to perform audio source separation without applying a mask.

44. 예 41-43의 방법에 있어서, 라벨링된 소스 오디오 데이터 및 라벨링된 노이즈 오디오 데이터를 포함하는 훈련 데이터세트를 제공하는 단계를 더 포함하고, 훈련 데이터세트를 사용하여 훈련된 오디오 소스 분리 모델을 훈련시켜 일반 소스 분리 모델을 생성하는 단계를 더 포함하는, 방법.44. A method according to any one of claims 41 to 43, further comprising the step of providing a training dataset including labeled source audio data and labeled noise audio data, and further comprising the step of training an audio source separation model using the training dataset to generate a general source separation model.

45. 예 41-44의 방법에 있어서, 동적으로 진화하는 데이터세트를 생성하기 위해 생성된 복수의 오디오 스템들의 적어도 서브세트를 훈련 데이터세트에 추가하는 단계, 임계값 메트릭에 기초하여 동적으로 진화하는 데이터세트를 컬링하는 단계, 및 컬링된 동적으로 진화하는 데이터세트를 사용하여 업데이트된 오디오 소스 분리 모델을 훈련시키는 단계를 더 포함하는, 방법.45. A method according to any one of claims 41-44, further comprising the steps of adding at least a subset of the plurality of audio stems generated to generate a dynamically evolving dataset to a training dataset, culling the dynamically evolving dataset based on a threshold metric, and training an updated audio source separation model using the culled dynamically evolving dataset.

46. 예 41-45의 방법에서, 상기 자기 반복 훈련 프로세스는 생성된 복수의 오디오 스템들과 연관된 제 1 품질 메트릭을 계산하는 단계로서, 제 1 품질 메트릭은 훈련된 오디오 소스 분리 모델의 제 1 성능 측정치를 제공하는, 상기 제 1 품질 메트릭을 계산하는 단계, 향상된 오디오 스템들과 연관된 제 2 품질 메트릭을 계산하는 단계로서, 제 2 품질 메트릭은 업데이트된 오디오 소스 분리 모델의 성능 측정치를 제공하는, 상기 제 2 품질 메트릭을 계산하는 단계, 및 제 2 품질 메트릭을 제 1 품질 메트릭과 비교하여 제 2 품질 메트릭이 제 1 품질 메트릭보다 큰 것을 확인하는 단계를 더 포함하는, 방법.46. The method of any of claims 41-45, wherein the self-repetitive training process further comprises: computing a first quality metric associated with the generated plurality of audio stems, wherein the first quality metric provides a first performance measure of the trained audio source separation model; computing a second quality metric associated with the enhanced audio stems, wherein the second quality metric provides a performance measure of the updated audio source separation model; and comparing the second quality metric to the first quality metric to determine that the second quality metric is greater than the first quality metric.

47. 예 41-46의 방법에 있어서, 훈련된 오디오 소스 분리 모델은 복수의 데이터세트들을 포함하는 훈련 데이터세트를 사용하여 훈련되고, 복수의 데이터세트들 각각은 상이한 소스 분리 문제를 해결하기 위해 오디오 소스 분리 모델을 훈련시키도록 구성된 라벨링된 오디오 샘플들을 포함하는, 방법.47. A method according to any one of claims 41 to 46, wherein the trained audio source separation model is trained using a training dataset comprising a plurality of datasets, each of the plurality of datasets including labeled audio samples configured to train the audio source separation model to solve a different source separation problem.

48. 예 41-47의 방법에 있어서, 복수의 데이터세트들은 복수의 라벨링된 스피치 샘플들을 포함하는 스피치 훈련 데이터세트, 및/또는 복수의 라벨링된 음악 및/또는 노이즈 데이터 샘플들을 포함하는 넌스피치 훈련 데이터세트를 포함하는, 방법.48. A method according to any of claims 41-47, wherein the plurality of datasets comprises a speech training dataset comprising a plurality of labeled speech samples, and/or a nonspeech training dataset comprising a plurality of labeled music and/or noise data samples.

49. 예 41-48의 방법에 있어서, 자기 반복 훈련 프로세스는 자기 반복 데이터세트에 대해 생성된 복수의 오디오 스템들로부터 라벨링된 오디오 샘플들을 생성하는 단계를 더 포함하는, 방법.49. A method according to any one of claims 41-48, wherein the self-repeating training process further comprises generating labeled audio samples from a plurality of audio stems generated for the self-repeating dataset.

50. 예 41-49의 방법에 있어서, 소스 신호와 나머지 보완 신호를 분리하는 단계를 포함하는 계층적 분기 시퀀스를 사용하여 복수의 향상된 오디오 스템들을 생성하는 단계를 더 포함하는, 방법.50. A method according to any one of claims 41 to 49, further comprising the step of generating a plurality of enhanced audio stems using a hierarchical branching sequence comprising the step of separating the source signal and the remaining complementary signal.

적용 가능한 경우, 본 개시에 의해 제공되는 다양한 구현들은 하드웨어, 소프트웨어, 또는 하드웨어와 소프트웨어의 조합들을 사용하여 구현될 수 있다. 또한, 적용 가능한 경우, 본 명세서에 설명된 다양한 하드웨어 구성요소들 및/또는 소프트웨어 구성요소들은 본 개시의 정신에서 벗어나지 않고 소프트웨어, 하드웨어, 및/또는 둘 다를 포함하는 복합 구성요소들로 결합될 수 있다. 적용 가능한 경우, 본 명세서에 설명된 다양한 하드웨어 구성요소들 및/또는 소프트웨어 구성요소들은 본 개시의 정신에서 벗어나지 않고 소프트웨어, 하드웨어, 또는 둘 다를 포함하는 하위 구성요소들로 분리될 수 있다.Where applicable, the various implementations provided by the present disclosure may be implemented using hardware, software, or combinations of hardware and software. Furthermore, where applicable, the various hardware components and/or software components described herein may be combined into composite components comprising software, hardware, and/or both without departing from the spirit of the present disclosure. Where applicable, the various hardware components and/or software components described herein may be separated into subcomponents comprising software, hardware, or both without departing from the spirit of the present disclosure.

비일시적 명령들, 프로그램 코드, 및/또는 데이터와 같은 본 개시에 따른 소프트웨어는 하나 이상의 비일시적 기계 판독 가능 매체들에 저장될 수 있다. 또한, 본 명세서에서 식별된 소프트웨어는 하나 이상의 범용 또는 특수 목적의 컴퓨터들 및/또는 컴퓨터 시스템들, 네트워크 및/또는 다른 방식을 사용하여 구현될 수 있는 것으로 고려된다. 적용 가능한 경우, 본 문서에 설명된 다양한 단계의 순서는 변경될 수 있으며, 복합 단계들로 결합 및/또는 본 문서에 설명된 특징들을 제공하기 위해 하위 단계들로 분리될 수 있다. 위에서 설명한 구현들은 본 발명을 예시하지만 본 발명을 한정하는 것은 아니다. 또한, 본 개시의 원리들에 따라 수 많은 수정들 및 변형들이 가능하다는 것을 이해해야 한다.The software according to the present disclosure, such as non-transitory instructions, program code, and/or data, may be stored on one or more non-transitory machine-readable media. It is also contemplated that the software identified herein may be implemented using one or more general-purpose or special-purpose computers and/or computer systems, networks, and/or other methods. Where applicable, the order of the various steps described herein may be changed, combined into composite steps, and/or separated into sub-steps to provide the features described herein. The implementations described above are illustrative of the present disclosure but are not intended to limit the present disclosure. It should also be understood that numerous modifications and variations are possible within the principles of the present disclosure.