Sherpa-ONNX Server Setup
Sherpa-ONNX is a local speech recognition server. Running it alongside the app enables private, offline transcription via the "Sherpa (local)" source option.
1. Download Server Binary
For macOS (Universal -- works on Intel and Apple Silicon):
mkdir -p ~/sherpa-onnx/bin && cd ~/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.23/sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2
tar xf sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2
cp sherpa-onnx-v1.12.23-osx-universal2-shared/bin/sherpa-onnx-online-websocket-server bin/
cp -r sherpa-onnx-v1.12.23-osx-universal2-shared/lib .Note: macOS will quarantine the binary. Run:
xattr -r -d com.apple.quarantine ~/sherpa-onnx/For other platforms, check the releases page.
2. Choose a Model
Nemotron Streaming (~600 MB) -- recommended NVIDIA's cache-aware streaming model (600M params int8). Avg WER 7.2% with punctuation and capitalization. Trained on 285k hours. Model card
Download
cd ~/sherpa-onnx && mkdir -p models && cd models
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14.tar.bz2
tar xf sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14.tar.bz2Start server
~/sherpa-onnx/bin/sherpa-onnx-online-websocket-server \
--port=6006 \
--max-batch-size=1 \
--loop-interval-ms=10 \
--tokens=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/tokens.txt \
--encoder=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/encoder.int8.onnx \
--decoder=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/decoder.int8.onnx \
--joiner=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/joiner.int8.onnxZipformer Small (~55 MB) -- lightweight alternative Fastest option, lowest resource usage. No punctuation or capitalization, lower accuracy.
Download
cd ~/sherpa-onnx && mkdir -p models && cd models
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06.tar.bz2
tar xf sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06.tar.bz2Start server
~/sherpa-onnx/bin/sherpa-onnx-online-websocket-server \
--port=6006 \
--max-batch-size=1 \
--loop-interval-ms=10 \
--tokens=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/tokens.txt \
--encoder=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/encoder.onnx \
--decoder=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/decoder.onnx \
--joiner=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/joiner.onnxFor all available models: online transducer models.
3. Server Flags
--port=6006-- WebSocket port (must match the URL in the app, defaultws://localhost:6006)--max-batch-size=1-- Process immediately instead of batching (reduces latency for single user)--loop-interval-ms=10-- Server polling interval in ms (lower = less latency)