Otosaku/NeMoConformerASR-iOSPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

On-device speech-to-text for iOS/macOS powered by NVIDIA NeMo Conformer CTC Small (13M params). Pure Swift + CoreML implementation with automatic audio padding, chunking for long audio, and real-time recognition.

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ConformerExample		ConformerExample
Sources/NeMoConformerASR		Sources/NeMoConformerASR
Tests/NeMoConformerASRTests		Tests/NeMoConformerASRTests
.gitignore		.gitignore
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Repository files navigation

NeMoConformerASR-iOS

Swift library for speech recognition using NVIDIA NeMo Conformer CTC model on iOS/macOS with CoreML.

Features

NVIDIA NeMo Conformer CTC Small model (13M parameters)
VAD-based smart segmentation for long audio (powered byNeMoVAD-iOS)
Returns both full text and timestamped segments
Automatic audio padding for any duration
Support for 5, 10, 15, and 20 second audio segments
Pure Swift implementation with CoreML backend

Requirements

iOS 16.0+ / macOS 13.0+
Xcode 15.0+
Swift 5.9+

Installation

Swift Package Manager

Add the following to yourPackage.swift:

dependencies:[.package(url:"https://github.com/Otosaku/NeMoConformerASR-iOS.git", from:"1.1.0")]

Note: Version 1.1.0+ includes VAD-based segmentation with timestamped results. For the previous API returning plain text, use version 1.0.0.

Or in Xcode: File → Add Package Dependencies → Enter repository URL.

Download Models

Download the CoreML models from Google Drive:

Download Models (30 MB)

The archive contains:

conformer_encoder.mlmodelc - Conformer encoder (30 MB)
conformer_decoder.mlmodelc - CTC decoder (0.4 MB)
vocabulary.json - BPE vocabulary (1024 tokens)

Usage

Basic Recognition

import NeMoConformerASR// Initialize with model pathsletasr=tryNeMoConformerASR(    encoderURL:Bundle.main.url(forResource:"conformer_encoder", withExtension:"mlmodelc")!,    decoderURL:Bundle.main.url(forResource:"conformer_decoder", withExtension:"mlmodelc")!,    vocabularyURL:Bundle.main.url(forResource:"vocabulary", withExtension:"json")!,    computeUnits:.all  // .cpuAndGPU, .cpuOnly, .cpuAndNeuralEngine)// Recognize speech (samples must be 16kHz mono Float32)letresult=try asr.recognize(samples: audioSamples)// Full recognized textprint(result.text)// Individual segments with timestampsforsegmentin result.segments{print("[\(segment.start)s -\(segment.end)s]:\(segment.text)")}// Audio durationprint("Duration:\(result.audioDuration)s")

ASRResult Structure

publicstructASRResult{lettext:String           // Full recognized textletsegments:[ASRSegment] // Timestamped segmentsletaudioDuration:Double  // Total audio duration in seconds}publicstructASRSegment{letstart:Double  // Start time in secondsletend:Double    // End time in secondslettext:String   // Recognized text for this segment}

Get Encoder Output

// Get encoder embeddings for downstream tasksletencoded=try asr.encode(samples: audioSamples)// Returns MLMultiArray with shape [1, 176, encodedFrames]

Supported Input Durations

The model supports the following input sizes (audio is automatically padded):

Duration	Samples	Mel Frames	Encoded Frames
5 sec	80,000	501	126
10 sec	160,000	1,001	251
15 sec	240,000	1,501	376
20 sec	320,000	2,001	501

Long Audio Processing

For audio longer than 20 seconds, the library uses VAD (Voice Activity Detection) for intelligent segmentation:

VAD Analysis: Detects speech vs silence regions
Smart Merging: Merges speech segments with gaps < 0.3s
Splitting: Splits segments longer than 20s into equal parts
Filtering: Ignores segments shorter than 0.5s
Recognition: Processes each segment independently

This approach provides accurate timestamps and avoids cutting words in the middle.

Example Project

The repository includes a complete example app with audio recording and file import.

Running the Example

OpenConformerExample/ConformerExample.xcodeproj in Xcode
Add NeMoConformerASR as a local package:
- File → Add Package Dependencies
- Click "Add Local..."
- Select theNeMoConformerASR-iOS folder
Download and add models to the project:
- Download models from the link above
- Unzip the archive
- Dragconformer_encoder.mlmodelc,conformer_decoder.mlmodelc, andvocabulary.json intoConformerExample/Resources folder in Xcode
- Make sure "Copy items if needed" is checked
- Verify files are added to "Copy Bundle Resources" in Build Phases
Build and run on device or simulator

Example Features

Record Audio: Tap to record from microphone, automatically converts to 16kHz mono
Import Audio: Import any audio file (mp3, wav, m4a, etc.), automatically converts format
Results: Shows recognized text, audio duration, and processing time
Segments View: Displays individual speech segments with timestamps for long audio

Model Information

Model: nvidia/stt_en_conformer_ctc_small
Parameters: 13.15M
Architecture: Conformer encoder (16 layers) + CTC decoder
Hidden dim: 176
Attention heads: 4
Vocabulary: 1024 BPE tokens + 1 blank

Audio Requirements

Sample rate: 16,000 Hz
Channels: Mono
Format: Float32

The example app handles conversion from any audio format automatically.

Dependencies

NeMoFeatureExtractor-iOS - Mel spectrogram extraction
NeMoVAD-iOS - Voice Activity Detection for smart segmentation

License

MIT License

Acknowledgments

NVIDIA NeMo - Original model and training

About

Releases2

1.1.0: Add VAD-based smart segmentation for long audio Latest

Feb 11, 2026

+ 1 release

Packages

No packages published

Contributors2

Languages

Swift100.0%

Movatterモバイル変換

Otosaku/NeMoConformerASR-iOS