Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

On-device speech-to-text for iOS/macOS powered by NVIDIA NeMo Conformer CTC Small (13M params). Pure Swift + CoreML implementation with automatic audio padding, chunking for long audio, and real-time recognition.

NotificationsYou must be signed in to change notification settings

Otosaku/NeMoConformerASR-iOS

Repository files navigation

Swift library for speech recognition using NVIDIA NeMo Conformer CTC model on iOS/macOS with CoreML.

Features

  • NVIDIA NeMo Conformer CTC Small model (13M parameters)
  • VAD-based smart segmentation for long audio (powered byNeMoVAD-iOS)
  • Returns both full text and timestamped segments
  • Automatic audio padding for any duration
  • Support for 5, 10, 15, and 20 second audio segments
  • Pure Swift implementation with CoreML backend

Requirements

  • iOS 16.0+ / macOS 13.0+
  • Xcode 15.0+
  • Swift 5.9+

Installation

Swift Package Manager

Add the following to yourPackage.swift:

dependencies:[.package(url:"https://github.com/Otosaku/NeMoConformerASR-iOS.git", from:"1.1.0")]

Note: Version 1.1.0+ includes VAD-based segmentation with timestamped results. For the previous API returning plain text, use version 1.0.0.

Or in Xcode: File → Add Package Dependencies → Enter repository URL.

Download Models

Download the CoreML models from Google Drive:

Download Models (30 MB)

The archive contains:

  • conformer_encoder.mlmodelc - Conformer encoder (30 MB)
  • conformer_decoder.mlmodelc - CTC decoder (0.4 MB)
  • vocabulary.json - BPE vocabulary (1024 tokens)

Usage

Basic Recognition

import NeMoConformerASR// Initialize with model pathsletasr=tryNeMoConformerASR(    encoderURL:Bundle.main.url(forResource:"conformer_encoder", withExtension:"mlmodelc")!,    decoderURL:Bundle.main.url(forResource:"conformer_decoder", withExtension:"mlmodelc")!,    vocabularyURL:Bundle.main.url(forResource:"vocabulary", withExtension:"json")!,    computeUnits:.all  // .cpuAndGPU, .cpuOnly, .cpuAndNeuralEngine)// Recognize speech (samples must be 16kHz mono Float32)letresult=try asr.recognize(samples: audioSamples)// Full recognized textprint(result.text)// Individual segments with timestampsforsegmentin result.segments{print("[\(segment.start)s -\(segment.end)s]:\(segment.text)")}// Audio durationprint("Duration:\(result.audioDuration)s")

ASRResult Structure

publicstructASRResult{lettext:String           // Full recognized textletsegments:[ASRSegment] // Timestamped segmentsletaudioDuration:Double  // Total audio duration in seconds}publicstructASRSegment{letstart:Double  // Start time in secondsletend:Double    // End time in secondslettext:String   // Recognized text for this segment}

Get Encoder Output

// Get encoder embeddings for downstream tasksletencoded=try asr.encode(samples: audioSamples)// Returns MLMultiArray with shape [1, 176, encodedFrames]

Supported Input Durations

The model supports the following input sizes (audio is automatically padded):

DurationSamplesMel FramesEncoded Frames
5 sec80,000501126
10 sec160,0001,001251
15 sec240,0001,501376
20 sec320,0002,001501

Long Audio Processing

For audio longer than 20 seconds, the library uses VAD (Voice Activity Detection) for intelligent segmentation:

  1. VAD Analysis: Detects speech vs silence regions
  2. Smart Merging: Merges speech segments with gaps < 0.3s
  3. Splitting: Splits segments longer than 20s into equal parts
  4. Filtering: Ignores segments shorter than 0.5s
  5. Recognition: Processes each segment independently

This approach provides accurate timestamps and avoids cutting words in the middle.

Example Project

The repository includes a complete example app with audio recording and file import.

Running the Example

  1. OpenConformerExample/ConformerExample.xcodeproj in Xcode

  2. Add NeMoConformerASR as a local package:

    • File → Add Package Dependencies
    • Click "Add Local..."
    • Select theNeMoConformerASR-iOS folder
  3. Download and add models to the project:

    • Download models from the link above
    • Unzip the archive
    • Dragconformer_encoder.mlmodelc,conformer_decoder.mlmodelc, andvocabulary.json intoConformerExample/Resources folder in Xcode
    • Make sure "Copy items if needed" is checked
    • Verify files are added to "Copy Bundle Resources" in Build Phases
  4. Build and run on device or simulator

Example Features

  • Record Audio: Tap to record from microphone, automatically converts to 16kHz mono
  • Import Audio: Import any audio file (mp3, wav, m4a, etc.), automatically converts format
  • Results: Shows recognized text, audio duration, and processing time
  • Segments View: Displays individual speech segments with timestamps for long audio

Model Information

  • Model: nvidia/stt_en_conformer_ctc_small
  • Parameters: 13.15M
  • Architecture: Conformer encoder (16 layers) + CTC decoder
  • Hidden dim: 176
  • Attention heads: 4
  • Vocabulary: 1024 BPE tokens + 1 blank

Audio Requirements

  • Sample rate: 16,000 Hz
  • Channels: Mono
  • Format: Float32

The example app handles conversion from any audio format automatically.

Dependencies

License

MIT License

Acknowledgments

About

On-device speech-to-text for iOS/macOS powered by NVIDIA NeMo Conformer CTC Small (13M params). Pure Swift + CoreML implementation with automatic audio padding, chunking for long audio, and real-time recognition.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2026 Movatter.jp