- Notifications
You must be signed in to change notification settings - Fork20
ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
License
khanld/chunkformer
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This repository contains the implementation and supplementary materials for our ICASSP 2025 paper,"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription". The paper has been fully accepted by the reviewers with the highest scores:4/4/4.
demo.mp4
ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses achunk-wise processing mechanism withrelative right context and employs theMasked Batch technique to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.
- Transcribing Extremely Long Audio: ChunkFormer cantranscribe audio recordings up to 16 hours in length with results comparable to existing models. It is currently the first model capable of handling this duration.
- Efficient Decoding on Low-Memory GPUs: Chunkformer canhandle long-form transcription on GPUs with limited memory without losing context or mismatching the training phase.
- Masked Batching Technique: ChunkFormer efficientlyremoves the need for padding in batches with highly variable lengths. For instance,decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.
| GPU Memory | Total Batch Duration (minutes) |
|---|---|
| 80GB | 980 |
| 24GB | 240 |
pip install chunkformer
# Clone the repositorygit clone https://github.com/your-username/chunkformer.gitcd chunkformer# Install in development modepip install -e.
| Language | Model |
|---|---|
| Vietnamese | |
| Vietnamese | |
| English | |
fromchunkformerimportChunkFormerModelimporttorchdevice="cuda:0"# Load a pre-trained model from Hugging Face or local directorymodel=ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device)x,x_len=model._load_audio_and_extract_features("path/to/audio")# x: (T, F), x_len: intx=x.unsqueeze(0).to(device)x_len=torch.tensor([x_len],device=device)# Extract featurefeature,feature_len=model.encode(xs=x,xs_lens=x_len,chunk_size=64,left_context_size=128,right_context_size=128,)print("feature: ",feature.shape)print("feature_len: ",feature_len)
ChunkFormer also supports speech classification tasks (e.g., gender, dialect, emotion, age recognition).
fromchunkformerimportChunkFormerModel# Load a pre-trained classification model from Hugging Face or local directorymodel=ChunkFormerModel.from_pretrained("path/to/classification/model")# Single audio classificationresult=model.classify_audio(audio_path="path/to/audio.wav",chunk_size=-1,# -1 for full attentionleft_context_size=-1,right_context_size=-1,)print(result)
fromchunkformerimportChunkFormerModel# Load a pre-trained encoder from Hugging Face or local directorymodel=ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie")# For single long-form audio transcriptiontranscription=model.endless_decode(audio_path="path/to/long_audio.wav",chunk_size=64,left_context_size=128,right_context_size=128,total_batch_duration=14400,# in secondsreturn_timestamps=True)print(transcription)# For batch processing of multiple audio filesaudio_files= ["audio1.wav","audio2.wav","audio3.wav"]transcriptions=model.batch_decode(audio_paths=audio_files,chunk_size=64,left_context_size=128,right_context_size=128,total_batch_duration=1800# Total batch duration in seconds)fori,transcriptioninenumerate(transcriptions):print(f"Audio{i+1}:{transcription}")
To test the model with a singlelong-form audio file. Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:
chunkformer-decode \ --model_checkpoint path/to/hf/checkpoint/repo \ --audio_file path/to/audio.wav \ --total_batch_duration 14400 \ --chunk_size 64 \ --left_context_size 128 \ --right_context_size 128
Example Output:
[00:00:01.200] - [00:00:02.400]: this is a transcription example[00:00:02.500] - [00:00:03.700]: testing the long-form audioThedata.tsv file must have at least one column namedwav. Optionally, a column namedtxt can be included to compute theWord Error Rate (WER). Output will be saved to the same file.
chunkformer-decode \ --model_checkpoint path/to/hf/checkpoint/repo \ --audio_list path/to/data.tsv \ --total_batch_duration 14400 \ --chunk_size 64 \ --left_context_size 128 \ --right_context_size 128
Example Output:
WER: 0.1234To classify a single audio file:
chunkformer-decode \ --model_checkpoint path/to/classification/model \ --audio_file path/to/audio.wav
See🚀 Training Guide 🚀 for complete documentation.
If you use this work in your research, please cite:
@INPROCEEDINGS{10888640,author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},year={2025},volume={},number={},pages={1-5},keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},doi={10.1109/ICASSP49660.2025.10888640}}
This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.
About
ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.