Local Speech-to-Text with the OpenAI Whisper Skill: No API Key Required

Ever needed to transcribe a voice memo, podcast, or meeting recording? The OpenAI Whisper skill for Clawdbot lets your AI assistant transcribe audio files locally—no API keys, no cloud uploads, no subscriptions. Just fast, accurate speech-to-text right on your machine.

Who Needs This?

If you've ever:

Wanted to search through old voice memos
Needed transcripts from meeting recordings
Wanted subtitles for videos
Had to translate audio from another language
Wanted to keep sensitive audio data local

...then this skill is for you.

Installation

Installing the skill is straightforward:

npx clawdhub@latest install openai-whisper

The skill requires the Whisper CLI. On macOS:

brew install openai-whisper

On Linux, you'll need Python and pip:

pip install openai-whisper

The first time you run Whisper, it downloads the model to ~/.cache/whisper. The default "turbo" model is about 3GB but offers an excellent speed/accuracy balance.

Basic Usage

Once installed, your Clawdbot can transcribe audio files. Here are the core commands:

Simple transcription to text file:

whisper /path/to/audio.mp3 --model medium --output_format txt --output_dir .

Generate subtitles (SRT format):

whisper /path/to/audio.m4a --output_format srt

Translate non-English audio to English:

whisper /path/to/spanish_audio.mp3 --task translate --output_format txt

Choosing the Right Model

Whisper offers multiple model sizes. Pick based on your needs:

tiny / base — Fast, good for quick drafts or when speed matters
small / medium — Great balance of speed and accuracy for most use cases
large / turbo — Best accuracy, but slower and needs more RAM

The skill defaults to turbo on recent installs, but you can override:

whisper recording.m4a --model small

Practical Examples

1. Transcribe a voice memo and save as text:

whisper ~/Downloads/voice_memo.m4a --output_format txt --output_dir ~/transcripts

2. Create subtitles for a video:

whisper video.mp4 --output_format srt --output_dir .

3. Batch process multiple files:

whisper *.mp3 --output_format txt

4. Transcribe with timestamps:

whisper interview.mp3 --output_format vtt

Tips & Best Practices

Audio quality matters — Clean audio transcribes better. Reduce background noise when possible.
Use the right format — Whisper handles mp3, m4a, wav, mp4, and more. No conversion needed.
Language detection is automatic — Whisper detects the spoken language, but you can force it with --language en.
Translation is one-way — The --task translate option converts to English only (not from).
Output formats — Choose txt (plain text), srt (subtitles), vtt (web subtitles), tsv (timestamps), or json (detailed).
GPU acceleration — If you have a CUDA-compatible GPU, Whisper uses it automatically for faster processing.

Why Local Matters

Using Whisper locally means:

Privacy — Sensitive recordings never leave your machine
No costs — No per-minute API charges
No limits — Transcribe as much as you want
Offline capable — Works without internet (after model download)

Conclusion

The OpenAI Whisper skill transforms Clawdbot into a powerful transcription assistant. Whether you're processing voice memos, creating video subtitles, or translating foreign-language audio, it's all done locally with zero API hassle.

Links:

Local Speech-to-Text with the OpenAI Whisper Skill: No API Key Required

Who Needs This?

Installation

Basic Usage

Choosing the Right Model

Practical Examples

Tips & Best Practices

Why Local Matters

Conclusion

Comments (0)

You might also like

Add Persistent Memory to Your OpenClaw Agent with Mem0

From "Suspicious" to "Benign": How We Fixed Our ClawHub Security Scan

How We Published Our First Skill to ClawHub (Step-by-Step)