Local Offline Text-to-Speech with the Sherpa-ONNX-TTS Skill
Voice output is a powerful way to make your AI assistant feel more natural and responsive. But most TTS solutions require sending your text to cloud APIs鈥攊ntroducing latency, costs, and privacy concerns. What if you could generate high-quality speech entirely offline, right on your local machine?
Enter the sherpa-onnx-tts skill: a Clawdbot integration that uses the sherpa-onnx runtime to convert text to speech without any network calls. It's fast, private, and works across macOS, Linux, and Windows.
Who Needs This?
- Privacy-focused users who don't want text sent to external servers
- Offline use cases where network connectivity is limited or unreliable
- Developers building voice-enabled agents that need predictable, low-latency speech
- Cost-conscious teams who want to avoid per-character API fees
Installation
First, install the skill from ClawdHub:
clawdhub install sherpa-onnx-ttsThe skill requires two components: the sherpa-onnx runtime and a voice model. The installer will download both automatically, but here's what happens under the hood:
- Runtime - The sherpa-onnx binary for your OS downloads to
~/.clawdbot/tools/sherpa-onnx-tts/runtime/ - Voice model - The default Piper English (US) "lessac" voice downloads to
~/.clawdbot/tools/sherpa-onnx-tts/models/
Configuration
Add the skill to your Clawdbot config (~/.clawdbot/clawdbot.json):
{
"skills": {
"entries": {
"sherpa-onnx-tts": {
"env": {
"SHERPA_ONNX_RUNTIME_DIR": "~/.clawdbot/tools/sherpa-onnx-tts/runtime",
"SHERPA_ONNX_MODEL_DIR": "~/.clawdbot/tools/sherpa-onnx-tts/models/vits-piper-en_US-lessac-high"
}
}
}
}
}The key environment variables:
| Variable | Purpose |
|---|---|
SHERPA_ONNX_RUNTIME_DIR | Path to the sherpa-onnx binaries |
SHERPA_ONNX_MODEL_DIR | Path to your chosen voice model |
SHERPA_ONNX_MODEL_FILE | (Optional) Specific .onnx file if model dir has multiple |
Usage Examples
Basic Speech Generation
Generate a WAV file from text:
sherpa-onnx-tts -o hello.wav "Hello! I'm speaking entirely offline."This creates a high-quality WAV file that you can play back immediately.
Piping Output
Combine with other tools for real-time playback:
sherpa-onnx-tts -o - "Breaking news: your build succeeded!" | afplay -(On Linux, use aplay instead of afplay.)
Longer Text with Pauses
The model handles punctuation naturally鈥攑eriods, commas, and question marks all produce appropriate pauses:
sherpa-onnx-tts -o reminder.wav "Don't forget: your meeting starts at 3pm. Want me to set a reminder?"Choosing Different Voices
The sherpa-onnx project offers dozens of free voice models. Browse the tts-models releases for options including:
- Multiple languages - German, French, Spanish, Chinese, and more
- Voice varieties - Male, female, different accents
- Quality tiers - Low, medium, and high quality options
To use a different voice, download the model archive, extract it, and update your SHERPA_ONNX_MODEL_DIR to point to the new model folder.
Pro Tips
-
Model size vs. quality: High-quality models (~50MB) sound noticeably better than low-quality ones (~15MB). The latency difference is minimal on modern hardware.
-
Cache common phrases: If you have frequently spoken phrases (like greetings or confirmations), pre-generate them and cache the WAV files.
-
Speed control: Some models support adjusting speech rate via the
--speedflag鈥攃hecksherpa-onnx-tts --helpfor available options. -
Integration with Clawdbot TTS: If you configure this skill properly, Clawdbot can use it as a TTS provider, giving your agent a voice in local-only deployments.
Troubleshooting
- "Model file not found" - Ensure
SHERPA_ONNX_MODEL_DIRpoints to the folder containing the.onnxfile, not its parent - Slow first run - The model loads into memory on first use; subsequent calls are faster
- Windows users - Run via Node.js:
node sherpa-onnx-tts -o tts.wav "Your text"
Conclusion
The sherpa-onnx-tts skill brings high-quality, offline text-to-speech to Clawdbot without cloud dependencies. Whether you're building privacy-first agents, working in air-gapped environments, or just want predictable TTS performance, this skill delivers.
Links:
Comments (0)
No comments yet. Be the first to comment!