Overview

Voice Commands is a new feature introduced in v1.2.58. It appears as a dedicated page in the sidebar with a microphone icon. The system captures audio from your microphone, runs it through a local Whisper.cpp speech-to-text engine, and compares the transcribed text against a list of command presets you define. When a match is found, the linked action fires automatically.

Because everything runs locally via Whisper.cpp, there is no network latency, no API key required, and no privacy concerns — your audio never leaves your PC.

Whisper.cpp Engine

VMSC bundles integration with Whisper.cpp, the high-performance C++ port of OpenAI’s Whisper model. The engine is not shipped with VMSC itself — instead, you install it with a single click from within the Voice Commands page.

One-Click Install

Click the Install Whisper Engine button on the Voice Commands page. VMSC automatically downloads the correct binary from the official GitHub Releases page. Two variants are offered:

CUDA — GPU-accelerated inference for NVIDIA GPUs. Significantly faster for medium and large models.
CPU — Works on any machine. Recommended if you do not have an NVIDIA GPU or if you only plan to use smaller models.

The installer places the binary in VMSC’s application data directory and verifies the checksum before activating.

Model Management

After the engine is installed, open the Model Management panel to download one or more Whisper models. VMSC supports 9 models across two categories:

Model	Size	Languages	Notes
tiny	~75 MB	English only	Fastest, lowest accuracy
tiny (multilingual)	~75 MB	Multilingual	Same speed, multilingual support
base	~142 MB	English only	Good balance for simple commands
base (multilingual)	~142 MB	Multilingual	Recommended starting point
small	~466 MB	English only	Noticeably better accuracy
small (multilingual)	~466 MB	Multilingual	Great accuracy, moderate speed
medium	~1.5 GB	English only	High accuracy, slower on CPU
medium (multilingual)	~1.5 GB	Multilingual	Best multilingual accuracy below large
large-v3-turbo	~3.1 GB	Multilingual	Highest accuracy, CUDA recommended

Models are downloaded to a shared cache directory. You can have multiple models downloaded at once and switch between them without restarting.

Audio Capture Pipeline

VMSC captures microphone audio using the Web Audio API’s ScriptProcessorNode. The raw PCM data is downsampled from your device’s native sample rate to 16 kHz mono (the format Whisper expects), then encoded into a WAV buffer.

Silence Detection

The capture pipeline includes an energy-based silence detector. Audio is processed in 5-second cycles. At the end of each cycle, the engine checks whether the buffer contains speech above the energy threshold. If the buffer is silent, it is discarded to save processing time. If speech is detected, the WAV buffer is sent to the Whisper engine for transcription.

The 5-second cycle length is a deliberate trade-off between responsiveness and transcription accuracy. Shorter windows risk splitting words across cycles; longer windows add latency.

Audio Input Device

Use the Audio Input dropdown at the top of the Voice Commands page to select which microphone or virtual audio device to capture from. The list is populated from the system’s available audio input devices and refreshes when devices are connected or disconnected.

Command Presets

A command preset defines a trigger phrase and the action that should fire when that phrase is detected in the transcript. You can create as many presets as you need.

Creating a Preset

Click + New Command in the Voice Commands page.
Enter a trigger phrase (e.g., “start the music”).
Select a linked action from the action picker (any VMSC action is available).
Optionally set a cooldown in seconds to prevent rapid repeated triggers.
Toggle the preset enabled or disabled.

Fuzzy Matching

Trigger phrases use fuzzy matching with text normalization. Before comparison, both the transcript and the trigger phrase are:

Converted to lowercase
Stripped of punctuation and extra whitespace
Compared using a similarity algorithm that tolerates minor transcription errors

This means a trigger phrase of start the music will still match a transcript of Start the music! or even start da music depending on the confidence threshold.

Cooldown

Each preset can have an optional cooldown period (in seconds). After the preset fires, it will not fire again until the cooldown expires. This prevents accidental double-triggers when you repeat yourself or the engine transcribes overlapping audio cycles.

Live Transcript

The bottom half of the Voice Commands page displays a live transcript of everything the engine hears. Each transcript entry shows:

Timestamp — when the audio cycle was processed.
Transcribed text — the raw output from Whisper.
Confidence score — a percentage indicating how confident the engine is in the transcription.
MATCHED badge — a green badge appears next to entries that triggered a command preset.

The transcript panel includes Save, Copy, and Clear buttons. Save exports the transcript as a timestamped text file.

Settings

The Voice Commands page exposes several configuration options at the top of the panel:

Setting	Description
Audio Input Device	Select which microphone to capture from.
Language	Choose from 9 supported languages: English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese. This hints the Whisper engine to improve accuracy for that language.
Confidence Threshold	A slider (0–100%) that sets the minimum confidence score required for a transcript to be considered for command matching. Lower values catch more speech but may produce false positives.
Whisper Model	Select which downloaded model to use for inference.

The Voice Commands system runs as a global singleton. Only one instance of the audio capture and Whisper engine runs at a time, regardless of how many command presets exist. Switching models or audio devices restarts the singleton.

Tips & Best Practices

Start with the base (multilingual) model — it offers the best speed-to-accuracy ratio for short command phrases.
Keep trigger phrases 2–5 words long. Longer phrases are harder to match consistently.
Use distinct trigger phrases that do not overlap with each other. Avoid phrases that are substrings of other phrases.
Set a cooldown of 3–5 seconds on presets that trigger disruptive actions (e.g., playing a loud sound).
If you stream in a noisy environment, raise the confidence threshold to reduce false positives.
For GPU users, the large-v3-turbo model with CUDA delivers near-instant transcription even for complex sentences.