Overview
Voice Commands is a new feature introduced in v1.2.58. It appears as a dedicated page in the sidebar with a microphone icon. The system captures audio from your microphone, runs it through a local Whisper.cpp speech-to-text engine, and compares the transcribed text against a list of command presets you define. When a match is found, the linked action fires automatically.
Because everything runs locally via Whisper.cpp, there is no network latency, no API key required, and no privacy concerns — your audio never leaves your PC.
Whisper.cpp Engine
VMSC bundles integration with Whisper.cpp, the high-performance C++ port of OpenAI’s Whisper model. The engine is not shipped with VMSC itself — instead, you install it with a single click from within the Voice Commands page.
One-Click Install
Click the Install Whisper Engine button on the Voice Commands page. VMSC automatically downloads the correct binary from the official GitHub Releases page. Two variants are offered:
- CUDA — GPU-accelerated inference for NVIDIA GPUs. Significantly faster for medium and large models.
- CPU — Works on any machine. Recommended if you do not have an NVIDIA GPU or if you only plan to use smaller models.
The installer places the binary in VMSC’s application data directory and verifies the checksum before activating.
Model Management
After the engine is installed, open the Model Management panel to download one or more Whisper models. VMSC supports 9 models across two categories:
| Model | Size | Languages | Notes |
|---|---|---|---|
| tiny | ~75 MB | English only | Fastest, lowest accuracy |
| tiny (multilingual) | ~75 MB | Multilingual | Same speed, multilingual support |
| base | ~142 MB | English only | Good balance for simple commands |
| base (multilingual) | ~142 MB | Multilingual | Recommended starting point |
| small | ~466 MB | English only | Noticeably better accuracy |
| small (multilingual) | ~466 MB | Multilingual | Great accuracy, moderate speed |
| medium | ~1.5 GB | English only | High accuracy, slower on CPU |
| medium (multilingual) | ~1.5 GB | Multilingual | Best multilingual accuracy below large |
| large-v3-turbo | ~3.1 GB | Multilingual | Highest accuracy, CUDA recommended |
Models are downloaded to a shared cache directory. You can have multiple models downloaded at once and switch between them without restarting.
Audio Capture Pipeline
VMSC captures microphone audio using the Web Audio API’s ScriptProcessorNode.
The raw PCM data is downsampled from your device’s native sample rate to 16 kHz
mono (the format Whisper expects), then encoded into a WAV buffer.
Silence Detection
The capture pipeline includes an energy-based silence detector. Audio is processed in 5-second cycles. At the end of each cycle, the engine checks whether the buffer contains speech above the energy threshold. If the buffer is silent, it is discarded to save processing time. If speech is detected, the WAV buffer is sent to the Whisper engine for transcription.
Audio Input Device
Use the Audio Input dropdown at the top of the Voice Commands page to select which microphone or virtual audio device to capture from. The list is populated from the system’s available audio input devices and refreshes when devices are connected or disconnected.
Command Presets
A command preset defines a trigger phrase and the action that should fire when that phrase is detected in the transcript. You can create as many presets as you need.
Creating a Preset
- Click + New Command in the Voice Commands page.
- Enter a trigger phrase (e.g., “start the music”).
- Select a linked action from the action picker (any VMSC action is available).
- Optionally set a cooldown in seconds to prevent rapid repeated triggers.
- Toggle the preset enabled or disabled.
Fuzzy Matching
Trigger phrases use fuzzy matching with text normalization. Before comparison, both the transcript and the trigger phrase are:
- Converted to lowercase
- Stripped of punctuation and extra whitespace
- Compared using a similarity algorithm that tolerates minor transcription errors
This means a trigger phrase of start the music will still match a transcript of
Start the music! or even start da music depending on the confidence
threshold.
Cooldown
Each preset can have an optional cooldown period (in seconds). After the preset fires, it will not fire again until the cooldown expires. This prevents accidental double-triggers when you repeat yourself or the engine transcribes overlapping audio cycles.
Live Transcript
The bottom half of the Voice Commands page displays a live transcript of everything the engine hears. Each transcript entry shows:
- Timestamp — when the audio cycle was processed.
- Transcribed text — the raw output from Whisper.
- Confidence score — a percentage indicating how confident the engine is in the transcription.
- MATCHED badge — a green badge appears next to entries that triggered a command preset.
The transcript panel includes Save, Copy, and Clear buttons. Save exports the transcript as a timestamped text file.
Settings
The Voice Commands page exposes several configuration options at the top of the panel:
| Setting | Description |
|---|---|
| Audio Input Device | Select which microphone to capture from. |
| Language | Choose from 9 supported languages: English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese. This hints the Whisper engine to improve accuracy for that language. |
| Confidence Threshold | A slider (0–100%) that sets the minimum confidence score required for a transcript to be considered for command matching. Lower values catch more speech but may produce false positives. |
| Whisper Model | Select which downloaded model to use for inference. |
Tips & Best Practices
- Start with the base (multilingual) model — it offers the best speed-to-accuracy ratio for short command phrases.
- Keep trigger phrases 2–5 words long. Longer phrases are harder to match consistently.
- Use distinct trigger phrases that do not overlap with each other. Avoid phrases that are substrings of other phrases.
- Set a cooldown of 3–5 seconds on presets that trigger disruptive actions (e.g., playing a loud sound).
- If you stream in a noisy environment, raise the confidence threshold to reduce false positives.
- For GPU users, the large-v3-turbo model with CUDA delivers near-instant transcription even for complex sentences.