desh2608/gssPublic

NotificationsYou must be signed in to change notification settings
Fork16
Star125

A simple package for Guided source separation (GSS)

License

MIT license

125 stars 16 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 902 Commits
.github/workflows		.github/workflows
gss		gss
recipes		recipes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
test.pstats		test.pstats

Repository files navigation

GPU-accelerated Guided Source Separation

Paper:https://arxiv.org/abs/2212.05271

Guided source separation is a type of blind source separation (blind = no training required)in which the mask estimation is guided by a diarizer output. The original method was proposedfor the CHiME-5 challenge inthis paper by Boeddeker et al.

It is a kind of target-speaker extraction method. The inputs to the model are:

A multi-channel recording, e.g., from an array microphone, of a long, unsegmented,multi-talker session (possibly with overlapping speech)
An RTTM file containing speaker segment boundaries

The system produces enhanced audio for each of the segments in the RTTM, removing the backgroundspeech and noise and "extracting" only the target speaker in the segment.

This repository contains a GPU implementation of this method in Python, along with CLI binariesto run the enhancement from shell. We also provide several example "recipes" for using themethod.

Features

The core components of the tool are borrowed frompb_chime5, but GPU support is added by porting most of the work toCuPy.

All the main components of the pipeline --- STFT computation, WPE, mask estimation with CACGMM, and beamforming ---are ported to CuPy to use GPUs. For CACGMM, we batch all frequency indices instead of iterating over them.
We have implemented batch processing of segments (seethis issue for details)to maximize GPU memory usage and provide additional speed-up.
The GSS implementation (seegss/core) has been stripped of CHiME-6 dataset-specific peculiarities(such as array naming conventions etc.)
We use Lhotse for simplified data loading, speaker activity generation, and RTTM representation. We provideexamples in therecipes directory for how to use thegss module for several datasets.
The inference can be done on multi-node GPU environment. This makes it several times faster than theoriginal CPU implementation.
We provide both Python modules and CLI for using the enhancement functions, which can beeasily included in recipes from Kaldi, Icefall, ESPNet, etc.

As an example, applying GSS on a LibriCSS OV20 session (~10min) took ~160s on a single RTX2080 GPU (with 12G memory).See thetest.pstats for the profiling output.

Installation

Preparing to install

Create a new Conda environment:

conda create -n gss python=3.8

Install CuPy as follows (seehttps://docs.cupy.dev/en/stable/install.html for the appropriate versionfor your CUDA).

pip install cupy-cuda102

NOTE 1: We recommend not installing the pre-release version (12.0.0rc1 at the time of writing), since there may be some issues with it.

NOTE 2: if you don't have cudatoolkit 10.2 installed, you can use conda which will install it for you:

conda install -c conda-forge cupy=10.2

Install (basic)

pip install git+http://github.com/desh2608/gss

Install (advanced)

git clone https://github.com/desh2608/gss.git&cd gsspip install -e'.[dev]'pre-commit install# installs pre-commit hooks with style checks

Usage

Enhancing a single recording

For the simple case of target-speaker extraction given a multi-channel recording and anRTTM file denoting speaker segments, run the following:

export CUDA_VISIBLE_DEVICES=0gss enhance recording \  /path/to/sessionA.wav /path/to/rttm exp/enhanced_segs \  --recording-id sessionA --min-segment-length 0.1 --max-segment-length 10.0 \  --max-batch-duration 20.0 --num-buckets 2 -o exp/segments.jsonl.gz

Enhancing a corpus

See therecipes directory for usage examples. The main stages are as follows:

Prepare Lhotse manifests. Seethis list of corpora currently supported in Lhotse.You can also apply GSS on your own dataset by preparing it as Lhotse manifests.
If you are using an RTTM file to get segments (e.g. in CHiME-6 Track 2), convert the RTTMsto Lhotse-style supervision manifest.
Create recording-level cut sets by combining the recording with its supervisions. Thesewill be used to get speaker activities.
Trim the recording-level cut set into segment-level cuts. These are the segments that willactually be enhanced.
(Optional) Split the segments into as many parts as the number of GPU jobs you want to run. In therecipes, we submit the jobs throughqsub , similar to Kaldi or ESPNet recipes. You canuse the parallelization in those toolkits to additionally use a different scheduler such asSLURM.
Run the enhancement on GPUs. The following options can be provided:

--channels: The channels to use for enhancement (comma-separated ints). By default, all channels are used.
--bss-iteration: Number of iterations of the CACGMM inference.
--context-duration: Context (in seconds) to include on both sides of the segment.
--min-segment-length: Any segment shorter than this value will be removed. This isparticularly useful when using segments from a diarizer output since they often containvery small segments which are not relevant for ASR. A recommended setting is 0.1s.
--max-segment-length: Segments longer than this value will be chunked up. This isto prevent OOM errors since the segment STFTs are loaded onto the GPU. We use a settingof 15s in most cases.
--max-batch-duration: Segments from the same speaker will be batched together to increaseGPU efficiency. We used 20s batches for enhancement on GPUs with 12G memory. For GPUs withlarger memory, this value can be increased.
--max-batch-cuts: This sets an upper limit on the maximum number of cuts in a batch. Tosimulate segment-wise enhancement, set this to 1.
--num-workers: Number of workers to use for data-loading (default = 1). Use more if youincrease themax-batch-duration .
--num-buckets: Number of buckets to use for sampling. Batches are drawn from the samebucket (see Lhotse'sDynamicBucketingSampler for details).
--enhanced-manifest/-o: Path to manifest file to write the enhanced cut manifest. Thisis useful for cases when the supervisions need to be propagated to the enhanced segments,for downstream ASR tasks, for example.
--profiler-output: Optional path to output stats file for profiling, which can be visualizedusing Snakeviz.
--force-overwrite: Flag to force enhanced audio files to be overwritten.

Multi-GPU Usage

You can refer to e.g. theAMI recipe for how to use this toolkitwith multiple GPUs.
NOTE: your GPUs must be in Exclusive_Thread mode, otherwise this library may not work as expected and/or the inferencetime will greatly increase.This is especially important if you are usingrun.pl.
You can check the compute mode of GPUX using:

nvidia-smi -i X -q| grep"Compute Mode"

We also provide an automate tool to do that calledgpu_check which takes as arguments the cmd used (e.g. run.pl) and number of jobs:

$cmd JOB=1:$nj${exp_dir}/${dset_name}/${dset_part}/log/enhance.JOB.log \    gss utils gpu_check$nj$cmd\& gss enhance cuts \${exp_dir}/${dset_name}/${dset_part}/cuts.jsonl.gz${exp_dir}/${dset_name}/${dset_part}/split$nj/cuts_per_segment.JOB.jsonl.gz \${exp_dir}/${dset_name}/${dset_part}/enhanced \      --bss-iterations$gss_iterations \      --context-duration 15.0 \      --use-garbage-class \      --max-batch-duration 120 \${affix}||exit 1

See againAMI recipe or theCHiME-7 DASR GSS code.

FAQ

What happens if I set the--max-batch-duration too large?

The enhancement would still work, but you will see several warnings of the sort:"Out of memory error while processing the batch. Trying again with chunks."Internally, we have a fallback option to chunk up batches into increasingly smallerparts in case OOM error is encountered (seegss.core.enhancer.py ). However, thiswould slow down processing, so we recommend reducing the batch size if you see thiswarning very frequently.

I am seeing "out of memory error" a lot. What should I do?

Try reducing--max-batch-duration . If you are enhancing a large number of very smallsegments, try providing--max-batch-cuts with some small value (e.g., 2 or 3). Thisis because batching together a large number of small segments requires memoryoverhead which can cause OOMs.

How to understand the format of output file names?

The enhanced wav files are named asrecoid-spkid-start_end.wav, i.e., 1 wav file isgenerated for each segment in the RTTM. The "start" and "end" are padded to 6 digits,for example: 21.18 seconds is encoded as002118 . This convention should be fine ifyour audio duration is under ~2.75 h (9999s), otherwise, you should change thepadding ingss/core/enhancer.py .

How to solve the Lhotse AudioDurationMismatch error?

This error is raised when the audio files corresponding to different channels havedifferent durations. This is often the case for multi-array recordings, e.g., CHiME-6.You can bypass this error by setting the--duration-tolerance option to some largervalue (Lhotse's default is 0.025). For CHiME-6, we had to set this to 3.0.

How should I generate RTTMs required for enhancement?

For examples of how to generate RTTMs for guiding the separation, please refer to mydiarizer toolkit.

How can I experiment with additional GSS parameters?

We have only made the most important parameters available in thetop-level CLI. To play with other parameters, check out thegss.enhancer.get_enhancer() function.

How much speed-up can I expect to obtain?

Enhancing the CHiME-6 dev set required 1.3 hours on 4 GPUs. This is as opposed to theoriginal implementation which required 20 hours using 80 CPU jobs. This is an effectivespeed-up of 292.

Contributing

Contributions for core improvements or new recipes are welcome. Please run the followingbefore creating a pull request.

pre-commit installpre-commit run# Running linter checks

Citations

@inproceedings{Raj2023GPUacceleratedGS,  title={GPU-accelerated Guided Source Separation for Meeting Transcription},  author={Desh Raj and Daniel Povey and Sanjeev Khudanpur},  year={2023},  booktitle={InterSpeech}}

About

A simple package for Guided source separation (GSS)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GPU-accelerated Guided Source Separation

Features

Installation

Preparing to install

Install (basic)

Install (advanced)

Usage

Enhancing a single recording

Enhancing a corpus

Multi-GPU Usage

FAQ

Contributing

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors18

Uh oh!

Languages

Movatterモバイル変換

License

desh2608/gss

Folders and files

Latest commit

History

Repository files navigation

GPU-accelerated Guided Source Separation

Features

Installation

Preparing to install

Install (basic)

Install (advanced)

Usage

Enhancing a single recording

Enhancing a corpus

Multi-GPU Usage

FAQ

Contributing

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors18

Uh oh!

Languages

Packages