- Notifications
You must be signed in to change notification settings - Fork0
On-device speech-to-text for iOS/macOS powered by NVIDIA NeMo Conformer CTC Small (13M params). Pure Swift + CoreML implementation with automatic audio padding, chunking for long audio, and real-time recognition.
Otosaku/NeMoConformerASR-iOS
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Swift library for speech recognition using NVIDIA NeMo Conformer CTC model on iOS/macOS with CoreML.
- NVIDIA NeMo Conformer CTC Small model (13M parameters)
- VAD-based smart segmentation for long audio (powered byNeMoVAD-iOS)
- Returns both full text and timestamped segments
- Automatic audio padding for any duration
- Support for 5, 10, 15, and 20 second audio segments
- Pure Swift implementation with CoreML backend
- iOS 16.0+ / macOS 13.0+
- Xcode 15.0+
- Swift 5.9+
Add the following to yourPackage.swift:
dependencies:[.package(url:"https://github.com/Otosaku/NeMoConformerASR-iOS.git", from:"1.1.0")]
Note: Version 1.1.0+ includes VAD-based segmentation with timestamped results. For the previous API returning plain text, use version 1.0.0.
Or in Xcode: File → Add Package Dependencies → Enter repository URL.
Download the CoreML models from Google Drive:
The archive contains:
conformer_encoder.mlmodelc- Conformer encoder (30 MB)conformer_decoder.mlmodelc- CTC decoder (0.4 MB)vocabulary.json- BPE vocabulary (1024 tokens)
import NeMoConformerASR// Initialize with model pathsletasr=tryNeMoConformerASR( encoderURL:Bundle.main.url(forResource:"conformer_encoder", withExtension:"mlmodelc")!, decoderURL:Bundle.main.url(forResource:"conformer_decoder", withExtension:"mlmodelc")!, vocabularyURL:Bundle.main.url(forResource:"vocabulary", withExtension:"json")!, computeUnits:.all // .cpuAndGPU, .cpuOnly, .cpuAndNeuralEngine)// Recognize speech (samples must be 16kHz mono Float32)letresult=try asr.recognize(samples: audioSamples)// Full recognized textprint(result.text)// Individual segments with timestampsforsegmentin result.segments{print("[\(segment.start)s -\(segment.end)s]:\(segment.text)")}// Audio durationprint("Duration:\(result.audioDuration)s")
publicstructASRResult{lettext:String // Full recognized textletsegments:[ASRSegment] // Timestamped segmentsletaudioDuration:Double // Total audio duration in seconds}publicstructASRSegment{letstart:Double // Start time in secondsletend:Double // End time in secondslettext:String // Recognized text for this segment}
// Get encoder embeddings for downstream tasksletencoded=try asr.encode(samples: audioSamples)// Returns MLMultiArray with shape [1, 176, encodedFrames]
The model supports the following input sizes (audio is automatically padded):
| Duration | Samples | Mel Frames | Encoded Frames |
|---|---|---|---|
| 5 sec | 80,000 | 501 | 126 |
| 10 sec | 160,000 | 1,001 | 251 |
| 15 sec | 240,000 | 1,501 | 376 |
| 20 sec | 320,000 | 2,001 | 501 |
For audio longer than 20 seconds, the library uses VAD (Voice Activity Detection) for intelligent segmentation:
- VAD Analysis: Detects speech vs silence regions
- Smart Merging: Merges speech segments with gaps < 0.3s
- Splitting: Splits segments longer than 20s into equal parts
- Filtering: Ignores segments shorter than 0.5s
- Recognition: Processes each segment independently
This approach provides accurate timestamps and avoids cutting words in the middle.
The repository includes a complete example app with audio recording and file import.
Open
ConformerExample/ConformerExample.xcodeprojin XcodeAdd NeMoConformerASR as a local package:
- File → Add Package Dependencies
- Click "Add Local..."
- Select the
NeMoConformerASR-iOSfolder
Download and add models to the project:
- Download models from the link above
- Unzip the archive
- Drag
conformer_encoder.mlmodelc,conformer_decoder.mlmodelc, andvocabulary.jsonintoConformerExample/Resourcesfolder in Xcode - Make sure "Copy items if needed" is checked
- Verify files are added to "Copy Bundle Resources" in Build Phases
Build and run on device or simulator
- Record Audio: Tap to record from microphone, automatically converts to 16kHz mono
- Import Audio: Import any audio file (mp3, wav, m4a, etc.), automatically converts format
- Results: Shows recognized text, audio duration, and processing time
- Segments View: Displays individual speech segments with timestamps for long audio
- Model: nvidia/stt_en_conformer_ctc_small
- Parameters: 13.15M
- Architecture: Conformer encoder (16 layers) + CTC decoder
- Hidden dim: 176
- Attention heads: 4
- Vocabulary: 1024 BPE tokens + 1 blank
- Sample rate: 16,000 Hz
- Channels: Mono
- Format: Float32
The example app handles conversion from any audio format automatically.
- NeMoFeatureExtractor-iOS - Mel spectrogram extraction
- NeMoVAD-iOS - Voice Activity Detection for smart segmentation
MIT License
- NVIDIA NeMo - Original model and training
About
On-device speech-to-text for iOS/macOS powered by NVIDIA NeMo Conformer CTC Small (13M params). Pure Swift + CoreML implementation with automatic audio padding, chunking for long audio, and real-time recognition.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.