Skip to content

feat(vscode): multi-provider speech synthesis for AI responses#8839

Open
Ghenghis wants to merge 29 commits intoKilo-Org:mainfrom
AiDave71:feat/azure-voice-studio
Open

feat(vscode): multi-provider speech synthesis for AI responses#8839
Ghenghis wants to merge 29 commits intoKilo-Org:mainfrom
AiDave71:feat/azure-voice-studio

Conversation

@Ghenghis
Copy link
Copy Markdown

@Ghenghis Ghenghis commented Apr 13, 2026

Summary

Adds a Speech tab to Settings with 6 text-to-speech providers, all with free tiers:

  • Browser (default) — Web Speech API, offline, no setup
  • Azure Cognitive Services — 500K chars/month free, SSML, 125+ voices
  • Google Cloud TTS — 4M chars/month free, Neural2 + Studio voices
  • OpenAI TTS — $5 free credit, 10 voices
  • ElevenLabs — 10K chars/month free, expressive voices
  • Amazon Polly — 5M chars/month free (12 months), SSML

Architecture

  • SpeechProvider interface + SpeechProviderRegistry (matches upstream provider pattern)
  • Provider-agnostic playback with LRU cache (32 entries)
  • 25-rule text filter with sentiment detection
  • Per-provider capabilities gating (SSML, styles, emphasis, pronunciations)
  • Auto-speak, interrupt-on-type, voice favorites & presets

Key files

  • webview-ui/src/types/voice.ts — Core type definitions
  • webview-ui/src/data/speech-providers.ts — Registry
  • webview-ui/src/utils/speech-providers/ — 6 provider implementations
  • webview-ui/src/utils/speech-playback.ts — Unified playback engine
  • webview-ui/src/components/settings/SpeechTab.tsx — Settings UI

Test plan

  • 95 unit tests passing (bun:test): registry, browser-provider, azure-provider, text-filter
  • ESLint: 0 errors across 14 speech files
  • esbuild: 5 bundles, 0 errors
  • VSIX built and installed in VS Code
  • Manual: enable speech, test Browser provider (no API key needed)
  • Manual: test Azure/Google/OpenAI/ElevenLabs/Polly with free-tier keys

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

const [speechSettings, setSpeechSettings] = createSignal<SpeechSettings | null>(null)
let lastSpokenMessageId = ""

onMount(() => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Speech settings never refresh after the initial load

AppContent requests speechSettingsLoaded once on mount, but SpeechTab only sends updateSetting messages and the extension never pushes a refreshed settings payload back. In practice, toggling enabled, autoSpeak, or interruptOnType in the current webview will not change auto-speak behavior until the webview is reloaded.

region: ss.azure.region,
apiKey: ss.azure.apiKey,
voiceId: ss.azure.voiceId,
pitch: ss.tuning.pitch + sentiment.pitchModifier,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: sentimentIntensity has no effect on synthesis

The new slider is persisted in settings, but the auto-speak path always applies the full detectSentiment() modifiers here. Changing kilo-code.new.speech.sentimentIntensity never scales these deltas, so the user-facing control does nothing.

ensureAudioReady()

_abortController = new AbortController()
const cacheKey = SynthesisCache.hash(text, opts.voiceId, opts.style ?? "default", opts.pitch ?? 0, opts.rate ?? 1.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Cache key omits several tuning inputs

The synthesis cache only keys on text, voice, style, pitch, and rate. Changing styleDegree, emphasis, pronunciations, or audioFormat can still reuse stale audio from a previous request, so preview and auto-speak will not reliably reflect the current settings.


// 6. Remove diff hunks (@@ ... @@, +/- lines)
result = result.replace(/^@@\s.*@@.*$/gm, "")
result = result.replace(/^[+-]{1,3}\s.*$/gm, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This strips normal markdown bullet lists, not just diff hunks

/^[+-]{1,3}\s.*$/gm matches ordinary - item and + item list lines. Because assistant responses in this UI are commonly formatted as bullet lists, auto-speak will drop large chunks of normal prose before it ever reaches Azure TTS.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 13, 2026

Code Review Summary

Status: 10 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 9
SUGGESTION 1

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
packages/kilo-vscode/webview-ui/src/App.tsx 235 Speech settings are only loaded once, so changing speech toggles in the same webview does not affect auto-speak until reload.
packages/kilo-vscode/webview-ui/src/App.tsx 287 sentimentIntensity is persisted but never applied when computing pitch/rate modifiers.
packages/kilo-vscode/webview-ui/src/utils/speech-playback.ts 29 The synthesis cache key omits tuning fields like styleDegree, emphasis, pronunciations, and audioFormat, which can replay stale audio.
packages/kilo-vscode/webview-ui/src/utils/speech-text-filter.ts 56 The diff-line regex also matches normal markdown bullets, causing valid assistant prose to be dropped before synthesis.
packages/kilo-vscode/webview-ui/src/components/settings/SpeechTab.tsx 364 Switching away from Azure still falls back to speech.azure.voiceId, so previews use an invalid voice id until the user manually reselects one.
packages/kilo-vscode/webview-ui/src/components/settings/SpeechTab.tsx 1065 The audio format select always uses Azure-specific values instead of the active provider’s advertised formats.
packages/kilo-vscode/webview-ui/src/utils/speech-providers/polly-provider.ts 99 Polly requests use an X-Api-Key header instead of AWS SigV4 signing, so synthesis calls will be rejected.
packages/app/e2e/fixtures.ts 156 The new seedModel override is never used because localStorage still hard-codes kilo/mistralai/codestral-2508, so env-configured e2e runs can seed the wrong model.
script/changelog.ts 51 Changelog generation now shells out to a global kilo binary, which makes bun script/changelog.ts fail in a fresh checkout that only has repo dependencies installed.

SUGGESTION

File Line Issue
README.md 69 For markdown documentation, use markdown image syntax like ![Image Name](./path.png) instead of HTML <img> tags.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

None.

Files Reviewed (27 files)
  • docs/plans/2026-04-13-multi-provider-speech-design.md - 0 issues
  • docs/plans/2026-04-13-multi-provider-speech-implementation.md - 0 issues
  • packages/kilo-vscode/eslint.config.mjs - 0 issues
  • packages/kilo-vscode/package.json - 0 issues
  • packages/kilo-vscode/src/KiloProvider.ts - 0 issues
  • packages/kilo-vscode/src/webview-html-utils.ts - 0 issues
  • packages/kilo-vscode/tests/unit/azure-provider.test.ts - 0 issues
  • packages/kilo-vscode/tests/unit/browser-provider.test.ts - 0 issues
  • packages/kilo-vscode/tests/unit/speech-provider-registry.test.ts - 0 issues
  • packages/kilo-vscode/tests/unit/speech-text-filter.test.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/App.tsx - 2 issues
  • packages/kilo-vscode/webview-ui/src/components/settings/SpeechTab.tsx - 2 issues
  • packages/kilo-vscode/webview-ui/src/data/speech-providers.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/types/voice.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-playback.ts - 1 issue
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/azure-provider.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/browser-provider.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/elevenlabs-provider.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/google-provider.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/openai-provider.ts - 0 issues
  • packages/kilo-vscode/webview-ui/src/utils/speech-providers/polly-provider.ts - 1 issue
  • packages/kilo-vscode/src/agent-manager/run/service.ts - 0 issues
  • packages/opencode/src/cli/cmd/tui/plugin/runtime.ts - 0 issues
  • script/version.ts - 0 issues
  • script/changelog.ts - 1 issue
  • packages/app/e2e/fixtures.ts - 1 issue
  • README.md - 1 issue

Reviewed by gpt-5.4-20260305 · 4,072,210 tokens

@Ghenghis Ghenghis changed the title feat: Azure Voice Studio — Speech synthesis for AI responses feat(vscode): multi-provider speech synthesis for AI responses Apr 13, 2026
await speak(previewText(), p, {
region: getRegion() || undefined,
apiKey: getApiKey(),
voiceId: voiceId ?? s.azure.voiceId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Non-Azure providers still fall back to an Azure voice ID

When the user switches providers, handleProviderChange() clears the selected voice but leaves the persisted fallback in speech.azure.voiceId. Preview then sends en-GB-MaisieNeural (or another Azure-specific id) to Google/OpenAI/ElevenLabs/Polly until the user manually picks a voice, and those providers do not recognize that id.

description="Higher quality sounds better but uses more bandwidth and API quota"
>
<Select
options={AUDIO_FORMATS}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Audio format options are not provider-specific

This select always uses the Azure AUDIO_FORMATS values even when the active provider advertises a different capabilities.audioFormats set. For example, Google expects MP3/OGG_OPUS/LINEAR16, so choosing one of these Azure-only values will generate invalid synth requests.

method: "POST",
headers: {
"Content-Type": "application/json",
"X-Api-Key": apiKey,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Polly requests cannot authenticate with this header

Amazon Polly does not accept a raw access key in X-Api-Key; browser-side calls must be SigV4-signed or proxied through a backend that signs them. As written, every synthesis request here will be rejected, so the Polly provider is effectively nonfunctional.

Ghenghis and others added 25 commits April 13, 2026 21:00
Extended AzureVoice interface with description and styles fields.
Organized with en-GB first (Maisie as default voice).
Removed EDGE_TTS references -- Azure-only edition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- VoicePreset, SpeechSettings, PronunciationEntry interfaces
- DEFAULT_SPEECH_SETTINGS with en-GB-MaisieNeural default
- Speech message types added to WebviewMessage and ExtensionMessage unions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tts-azure.ts: Azure REST API synthesis with SSML builder
  (prosody, styles, emphasis, custom pronunciations)
- speech-playback.ts: Web Audio API playback with LRU cache (32 entries),
  volume control, abort/cancel support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Section 1: Connection + Global (collapsible) - API key, region,
  enable/auto-speak toggles, volume, interaction mode, sentiment
Section 2: Voice Browser + Favorites - search, locale filter,
  125+ voice cards with star/preview, favorites chips bar
Section 3: Voice Fine-Tuning (collapsible) - pitch, rate, volume,
  style chips, emphasis, pauses, pronunciations, presets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added Speech tab between Context and Experimental tabs
with speech-bubble icon.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- sendSpeechSettings(): reads all speech config from VS Code settings
- validateAzureKey(): tests Azure TTS endpoint with a probe synthesis
- Wired into init, reset, and message handler paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 24 speech configuration properties under kilo-code.new.speech.*
- Covers connection, global, tuning, favorites, and presets
- Default voice: en-GB-MaisieNeural
- Updated displayName to "Kilo Code: Azure Voice Edition"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Watches session busy→idle transition to speak last assistant reply
- Strips markdown/code blocks/URLs for natural speech
- Interrupts playback on keydown when interruptOnType enabled
- Stops speech on session switch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix eqeqeq warnings (== → === for null comparisons)
- Compact KiloProvider speech methods to stay within max-lines
- Add eslint-disable for complexity in message handler

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port 25-rule speech-text-filter.ts with 5-layer guardrails from source,
update App.tsx to use filterTextForSpeech + detectSentiment instead of
inline regex, add Azure TTS endpoint to CSP connect-src, compact switch
cases in KiloProvider to stay under max-lines lint rule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design for refactoring Azure-only speech into multi-provider architecture
with Browser (free/offline) as default and 5 additional providers with
free tiers (Azure, Google, OpenAI, ElevenLabs, Amazon Polly).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15-task plan covering provider interface, 6 providers (Browser, Azure,
Google, OpenAI, ElevenLabs, Polly), registry pattern, SpeechTab refactor,
CSP/config updates, tests, and PR submission.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define SpeechVoice, SynthesisOptions, and SpeechProvider interfaces
for multi-provider speech architecture. Add SpeechProviderRegistry
with register/get/list/listByTier operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement BrowserProvider wrapping window.speechSynthesis with guards
for non-browser environments. Free, offline, no API key required.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement AzureProvider that wraps tts-azure.ts and azure-voices.ts,
mapping AzureVoice to SpeechVoice with full SSML/style capabilities
and testConnection support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neural2 and Studio voices across en-US, en-GB, en-AU, en-IN locales
with SSML support and 4M chars/month free tier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 voices (alloy, ash, ballad, coral, echo, fable, nova, onyx, sage,
shimmer) with mp3/opus/aac/flac output and Bearer auth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 voices with actual ElevenLabs voice IDs, xi-api-key auth, and
10K chars/month free tier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
20 neural voices across en-GB, en-US, en-AU, en-NZ, en-ZA, en-IE,
en-IN with SSML/emphasis/pronunciation support. Notes SigV4 needed
for production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hard-coded Azure TTS with provider-agnostic speak() that
accepts a SpeechProvider, delegates synthesis to provider.synthesize(),
and handles both Blob results (Web Audio) and void results (Browser).
Cache key now includes provider.id. stop() calls provider.stop() in
addition to stopping any active AudioBufferSourceNode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add provider dropdown with optgroups (free / free-tier), per-provider
config sections (API key, region, test button), and capability-gated
tuning controls (styles, emphasis, pronunciations, audio formats).
Voice browser now renders voices from the active provider instead of
hard-coded Azure list. Extract ProviderConfigSection and ApiKeyRow
sub-components to keep complexity manageable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-up

Expand connect-src CSP to allow Google TTS, OpenAI, ElevenLabs, and
Amazon Polly endpoints. Add package.json config keys for provider
selection and per-provider API credentials. Update SpeechSettings
interface and DEFAULT_SPEECH_SETTINGS with provider field and optional
per-provider config blocks. Wire sendSpeechSettings() to read and
transmit all new provider settings to the webview.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ghenghis and others added 2 commits April 13, 2026 21:04
Replace hard-coded Azure API key/region checks in the busy-to-idle
auto-speak effect with provider-agnostic flow: resolve provider from
settings, check requiresApiKey against the correct per-provider key,
and pass the provider to speak(). Add getApiKeyForProvider helper to
map provider ID to the right credential field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ghenghis Ghenghis force-pushed the feat/azure-voice-studio branch from d6cfe12 to 2b07e5c Compare April 14, 2026 04:06
Showcase the multi-provider speech synthesis feature with 6 screenshots
demonstrating provider selection, voice browser, fine-tuning controls,
and both free (Browser) and premium (Azure) configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant