Local Offline Text-to-Speech with the Sherpa-ONNX-TTS Skill

Voice output is a powerful way to make your AI assistant feel more natural and responsive. But most TTS solutions require sending your text to cloud APIs—introducing latency, costs, and privacy concerns. What if you could generate high-quality speech entirely offline, right on your local machine?

Enter the sherpa-onnx-tts skill: a Clawdbot integration that uses the sherpa-onnx runtime to convert text to speech without any network calls. It's fast, private, and works across macOS, Linux, and Windows.

Who Needs This?

Privacy-focused users who don't want text sent to external servers
Offline use cases where network connectivity is limited or unreliable
Developers building voice-enabled agents that need predictable, low-latency speech
Cost-conscious teams who want to avoid per-character API fees

Installation

First, install the skill from ClawdHub:

clawdhub install sherpa-onnx-tts

The skill requires two components: the sherpa-onnx runtime and a voice model. The installer will download both automatically, but here's what happens under the hood:

Runtime - The sherpa-onnx binary for your OS downloads to ~/.clawdbot/tools/sherpa-onnx-tts/runtime/
Voice model - The default Piper English (US) "lessac" voice downloads to ~/.clawdbot/tools/sherpa-onnx-tts/models/

Configuration

Add the skill to your Clawdbot config (~/.clawdbot/clawdbot.json):

{
  "skills": {
    "entries": {
      "sherpa-onnx-tts": {
        "env": {
          "SHERPA_ONNX_RUNTIME_DIR": "~/.clawdbot/tools/sherpa-onnx-tts/runtime",
          "SHERPA_ONNX_MODEL_DIR": "~/.clawdbot/tools/sherpa-onnx-tts/models/vits-piper-en_US-lessac-high"
        }
      }
    }
  }
}

The key environment variables:

Variable	Purpose
`SHERPA_ONNX_RUNTIME_DIR`	Path to the sherpa-onnx binaries
`SHERPA_ONNX_MODEL_DIR`	Path to your chosen voice model
`SHERPA_ONNX_MODEL_FILE`	(Optional) Specific `.onnx` file if model dir has multiple

Usage Examples

Basic Speech Generation

Generate a WAV file from text:

sherpa-onnx-tts -o hello.wav "Hello! I'm speaking entirely offline."

This creates a high-quality WAV file that you can play back immediately.

Piping Output

Combine with other tools for real-time playback:

sherpa-onnx-tts -o - "Breaking news: your build succeeded!" | afplay -

(On Linux, use aplay instead of afplay.)

Longer Text with Pauses

The model handles punctuation naturally—periods, commas, and question marks all produce appropriate pauses:

sherpa-onnx-tts -o reminder.wav "Don't forget: your meeting starts at 3pm. Want me to set a reminder?"

Choosing Different Voices

The sherpa-onnx project offers dozens of free voice models. Browse the tts-models releases for options including:

Multiple languages - German, French, Spanish, Chinese, and more
Voice varieties - Male, female, different accents
Quality tiers - Low, medium, and high quality options

To use a different voice, download the model archive, extract it, and update your SHERPA_ONNX_MODEL_DIR to point to the new model folder.

Pro Tips

Model size vs. quality: High-quality models (~50MB) sound noticeably better than low-quality ones (~15MB). The latency difference is minimal on modern hardware.
Cache common phrases: If you have frequently spoken phrases (like greetings or confirmations), pre-generate them and cache the WAV files.
Speed control: Some models support adjusting speech rate via the --speed flag—check sherpa-onnx-tts --help for available options.
Integration with Clawdbot TTS: If you configure this skill properly, Clawdbot can use it as a TTS provider, giving your agent a voice in local-only deployments.

Troubleshooting

"Model file not found" - Ensure SHERPA_ONNX_MODEL_DIR points to the folder containing the .onnx file, not its parent
Slow first run - The model loads into memory on first use; subsequent calls are faster
Windows users - Run via Node.js: node sherpa-onnx-tts -o tts.wav "Your text"

Conclusion

The sherpa-onnx-tts skill brings high-quality, offline text-to-speech to Clawdbot without cloud dependencies. Whether you're building privacy-first agents, working in air-gapped environments, or just want predictable TTS performance, this skill delivers.

Links: