Voice Activity Detection (VAD) slicing tool powered by Silero VAD. Detects speech segments in WAV audio, splits into separate files, and generates an interactive HTML report.
- VAD Detection — Silero VAD via ONNX Runtime (CGO)
- Adaptive VAD — Dynamic noise floor estimation, auto-adjusting threshold/min_speech, RMS energy post-filter (
--adaptive) - Audio Slicing — Split WAV by speech segments
- HTML Report — Interactive waveform, VAD confidence chart, per-segment audio players with playback speed control
- Streaming API — Feed audio chunks incrementally via
Process()/Flush() - CLI + HTTP Server — Command-line tool and web upload interface
make init # 安装 onnxruntime + 下载 Silero VAD 模型
make run-server # 构建并启动 HTTP 服务 (localhost:8080)
make run-demo INPUT=audio.wav # 构建并运行 CLI 分析
make test # 运行测试
make clean # 清理编译产物# Install ONNX Runtime (macOS)
brew install onnxruntime
# Download Silero VAD model
pip3 install silero-vad
cp $(python3 -c "import silero_vad; import os; print(os.path.join(os.path.dirname(silero_vad.__file__),'data','silero_vad.onnx'))") .
# Run CLI (standard mode)
CGO_CFLAGS="-I/opt/homebrew/include/onnxruntime" \
CGO_LDFLAGS="-L/opt/homebrew/lib" \
go run ./cmd/demo --input audio.wav --model silero_vad.onnx
# Run CLI (adaptive mode — auto-adjusts to background noise)
CGO_CFLAGS="-I/opt/homebrew/include/onnxruntime" \
CGO_LDFLAGS="-L/opt/homebrew/lib" \
go run ./cmd/demo --input audio.wav --model silero_vad.onnx --adaptive
# Or run HTTP server
CGO_CFLAGS="-I/opt/homebrew/include/onnxruntime" \
CGO_LDFLAGS="-L/opt/homebrew/lib" \
go run ./cmd/server --model silero_vad.onnxIn noisy environments (e.g. shopping malls, exhibitions), standard VAD with a fixed threshold can't balance sensitivity vs. false positives:
- Fixed low threshold → captures background chatter as speech
- Fixed high threshold → misses soft-spoken users in quiet moments
The --adaptive flag enables runtime adaptation with three mechanisms:
Computes the average RMS energy of the quietest 10% of audio frames. This captures the background noise level rather than speech energy, ensuring accurate baseline in mixed speech/noise audio.
Maps the noise floor to VAD parameters:
| Noise Floor | Threshold | Min Speech |
|---|---|---|
| ≤ -50 dB | 0.5 | 250 ms |
| -50 ~ -40 | 0.5 → 0.7 | 250 → 400 |
| -40 ~ -35 | 0.7 → 0.8 | 400 → 500 |
| > -35 dB | 0.85 | 600 ms |
Discards detected segments whose average energy is below noiseFloor + 6 dB, filtering out distant/unintended speech.
import "github.com/LiusCraft/smart-vad/vad"
// Batch mode
adaptCfg := vad.AdaptiveConfig{
DetectorConfig: vad.Config{
ModelPath: "silero_vad.onnx",
SampleRate: 16000,
},
}
adaptDetector, _ := vad.NewAdaptiveDetector(adaptCfg)
result, _ := adaptDetector.Detect(pcm)
// Streaming mode
adaptDetector.Reset()
adaptDetector.Process(chunk1)
adaptDetector.Process(chunk2)
result := adaptDetector.Flush()├── vad/ # VAD detection package (streaming + batch)
├── slice/ # Audio slicing and WAV export
├── html/ # HTML report generation (embedded templates)
├── template/ # HTML templates (embedded via //go:embed)
├── cmd/demo/ # CLI entry point
└── cmd/server/ # HTTP server
import "github.com/LiusCraft/smart-vad/vad"
// Batch mode
result, err := detector.Detect(pcm)
// Streaming mode
detector.Reset()
detector.Process(chunk1)
detector.Process(chunk2)
result := detector.Flush()