Sherpa-ONNX Server Setup

Sherpa-ONNX is a local speech recognition server. Running it alongside the app enables private, offline transcription via the "Sherpa (local)" source option.

1. Download Server Binary

For macOS (Universal -- works on Intel and Apple Silicon):

mkdir -p ~/sherpa-onnx/bin && cd ~/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.23/sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2
tar xf sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2
cp sherpa-onnx-v1.12.23-osx-universal2-shared/bin/sherpa-onnx-online-websocket-server bin/
cp -r sherpa-onnx-v1.12.23-osx-universal2-shared/lib .

Note: macOS will quarantine the binary. Run:

xattr -r -d com.apple.quarantine ~/sherpa-onnx/

For other platforms, check the releases page.

2. Choose a Model

Nemotron Streaming (~600 MB) -- recommended NVIDIA's cache-aware streaming model (600M params int8). Avg WER 7.2% with punctuation and capitalization. Trained on 285k hours. Model card

Download

cd ~/sherpa-onnx && mkdir -p models && cd models
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14.tar.bz2
tar xf sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14.tar.bz2

Start server

~/sherpa-onnx/bin/sherpa-onnx-online-websocket-server \
  --port=6006 \
  --max-batch-size=1 \
  --loop-interval-ms=10 \
  --tokens=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/tokens.txt \
  --encoder=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/encoder.int8.onnx \
  --decoder=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/decoder.int8.onnx \
  --joiner=$HOME/sherpa-onnx/models/sherpa-onnx-nemotron-speech-streaming-en-0.6b-int8-2026-01-14/joiner.int8.onnx

Zipformer Small (~55 MB) -- lightweight alternative Fastest option, lowest resource usage. No punctuation or capitalization, lower accuracy.

Download

cd ~/sherpa-onnx && mkdir -p models && cd models
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06.tar.bz2
tar xf sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06.tar.bz2

Start server

~/sherpa-onnx/bin/sherpa-onnx-online-websocket-server \
  --port=6006 \
  --max-batch-size=1 \
  --loop-interval-ms=10 \
  --tokens=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/tokens.txt \
  --encoder=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/encoder.onnx \
  --decoder=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/decoder.onnx \
  --joiner=$HOME/sherpa-onnx/models/sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06/joiner.onnx

For all available models: online transducer models.

3. Server Flags

--port=6006 -- WebSocket port (must match the URL in the app, default ws://localhost:6006)
--max-batch-size=1 -- Process immediately instead of batching (reduces latency for single user)
--loop-interval-ms=10 -- Server polling interval in ms (lower = less latency)