feat(vscode): multi-provider speech synthesis for AI responses#8839
feat(vscode): multi-provider speech synthesis for AI responses#8839Ghenghis wants to merge 29 commits intoKilo-Org:mainfrom
Conversation
sync to main
| const [speechSettings, setSpeechSettings] = createSignal<SpeechSettings | null>(null) | ||
| let lastSpokenMessageId = "" | ||
|
|
||
| onMount(() => { |
There was a problem hiding this comment.
WARNING: Speech settings never refresh after the initial load
AppContent requests speechSettingsLoaded once on mount, but SpeechTab only sends updateSetting messages and the extension never pushes a refreshed settings payload back. In practice, toggling enabled, autoSpeak, or interruptOnType in the current webview will not change auto-speak behavior until the webview is reloaded.
| region: ss.azure.region, | ||
| apiKey: ss.azure.apiKey, | ||
| voiceId: ss.azure.voiceId, | ||
| pitch: ss.tuning.pitch + sentiment.pitchModifier, |
There was a problem hiding this comment.
WARNING: sentimentIntensity has no effect on synthesis
The new slider is persisted in settings, but the auto-speak path always applies the full detectSentiment() modifiers here. Changing kilo-code.new.speech.sentimentIntensity never scales these deltas, so the user-facing control does nothing.
| ensureAudioReady() | ||
|
|
||
| _abortController = new AbortController() | ||
| const cacheKey = SynthesisCache.hash(text, opts.voiceId, opts.style ?? "default", opts.pitch ?? 0, opts.rate ?? 1.0) |
There was a problem hiding this comment.
WARNING: Cache key omits several tuning inputs
The synthesis cache only keys on text, voice, style, pitch, and rate. Changing styleDegree, emphasis, pronunciations, or audioFormat can still reuse stale audio from a previous request, so preview and auto-speak will not reliably reflect the current settings.
|
|
||
| // 6. Remove diff hunks (@@ ... @@, +/- lines) | ||
| result = result.replace(/^@@\s.*@@.*$/gm, "") | ||
| result = result.replace(/^[+-]{1,3}\s.*$/gm, "") |
There was a problem hiding this comment.
WARNING: This strips normal markdown bullet lists, not just diff hunks
/^[+-]{1,3}\s.*$/gm matches ordinary - item and + item list lines. Because assistant responses in this UI are commonly formatted as bullet lists, auto-speak will drop large chunks of normal prose before it ever reaches Azure TTS.
Code Review SummaryStatus: 10 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
SUGGESTION
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: None. Files Reviewed (27 files)
Reviewed by gpt-5.4-20260305 · 4,072,210 tokens |
| await speak(previewText(), p, { | ||
| region: getRegion() || undefined, | ||
| apiKey: getApiKey(), | ||
| voiceId: voiceId ?? s.azure.voiceId, |
There was a problem hiding this comment.
WARNING: Non-Azure providers still fall back to an Azure voice ID
When the user switches providers, handleProviderChange() clears the selected voice but leaves the persisted fallback in speech.azure.voiceId. Preview then sends en-GB-MaisieNeural (or another Azure-specific id) to Google/OpenAI/ElevenLabs/Polly until the user manually picks a voice, and those providers do not recognize that id.
| description="Higher quality sounds better but uses more bandwidth and API quota" | ||
| > | ||
| <Select | ||
| options={AUDIO_FORMATS} |
There was a problem hiding this comment.
WARNING: Audio format options are not provider-specific
This select always uses the Azure AUDIO_FORMATS values even when the active provider advertises a different capabilities.audioFormats set. For example, Google expects MP3/OGG_OPUS/LINEAR16, so choosing one of these Azure-only values will generate invalid synth requests.
| method: "POST", | ||
| headers: { | ||
| "Content-Type": "application/json", | ||
| "X-Api-Key": apiKey, |
There was a problem hiding this comment.
WARNING: Polly requests cannot authenticate with this header
Amazon Polly does not accept a raw access key in X-Api-Key; browser-side calls must be SigV4-signed or proxied through a backend that signs them. As written, every synthesis request here will be rejected, so the Polly provider is effectively nonfunctional.
Extended AzureVoice interface with description and styles fields. Organized with en-GB first (Maisie as default voice). Removed EDGE_TTS references -- Azure-only edition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- VoicePreset, SpeechSettings, PronunciationEntry interfaces - DEFAULT_SPEECH_SETTINGS with en-GB-MaisieNeural default - Speech message types added to WebviewMessage and ExtensionMessage unions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- tts-azure.ts: Azure REST API synthesis with SSML builder (prosody, styles, emphasis, custom pronunciations) - speech-playback.ts: Web Audio API playback with LRU cache (32 entries), volume control, abort/cancel support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Section 1: Connection + Global (collapsible) - API key, region, enable/auto-speak toggles, volume, interaction mode, sentiment Section 2: Voice Browser + Favorites - search, locale filter, 125+ voice cards with star/preview, favorites chips bar Section 3: Voice Fine-Tuning (collapsible) - pitch, rate, volume, style chips, emphasis, pauses, pronunciations, presets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added Speech tab between Context and Experimental tabs with speech-bubble icon. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- sendSpeechSettings(): reads all speech config from VS Code settings - validateAzureKey(): tests Azure TTS endpoint with a probe synthesis - Wired into init, reset, and message handler paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 24 speech configuration properties under kilo-code.new.speech.* - Covers connection, global, tuning, favorites, and presets - Default voice: en-GB-MaisieNeural - Updated displayName to "Kilo Code: Azure Voice Edition" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Watches session busy→idle transition to speak last assistant reply - Strips markdown/code blocks/URLs for natural speech - Interrupts playback on keydown when interruptOnType enabled - Stops speech on session switch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix eqeqeq warnings (== → === for null comparisons) - Compact KiloProvider speech methods to stay within max-lines - Add eslint-disable for complexity in message handler Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port 25-rule speech-text-filter.ts with 5-layer guardrails from source, update App.tsx to use filterTextForSpeech + detectSentiment instead of inline regex, add Azure TTS endpoint to CSP connect-src, compact switch cases in KiloProvider to stay under max-lines lint rule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design for refactoring Azure-only speech into multi-provider architecture with Browser (free/offline) as default and 5 additional providers with free tiers (Azure, Google, OpenAI, ElevenLabs, Amazon Polly). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15-task plan covering provider interface, 6 providers (Browser, Azure, Google, OpenAI, ElevenLabs, Polly), registry pattern, SpeechTab refactor, CSP/config updates, tests, and PR submission. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define SpeechVoice, SynthesisOptions, and SpeechProvider interfaces for multi-provider speech architecture. Add SpeechProviderRegistry with register/get/list/listByTier operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement BrowserProvider wrapping window.speechSynthesis with guards for non-browser environments. Free, offline, no API key required. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement AzureProvider that wraps tts-azure.ts and azure-voices.ts, mapping AzureVoice to SpeechVoice with full SSML/style capabilities and testConnection support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neural2 and Studio voices across en-US, en-GB, en-AU, en-IN locales with SSML support and 4M chars/month free tier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 voices (alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer) with mp3/opus/aac/flac output and Bearer auth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 voices with actual ElevenLabs voice IDs, xi-api-key auth, and 10K chars/month free tier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
20 neural voices across en-GB, en-US, en-AU, en-NZ, en-ZA, en-IE, en-IN with SSML/emphasis/pronunciation support. Notes SigV4 needed for production. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hard-coded Azure TTS with provider-agnostic speak() that accepts a SpeechProvider, delegates synthesis to provider.synthesize(), and handles both Blob results (Web Audio) and void results (Browser). Cache key now includes provider.id. stop() calls provider.stop() in addition to stopping any active AudioBufferSourceNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add provider dropdown with optgroups (free / free-tier), per-provider config sections (API key, region, test button), and capability-gated tuning controls (styles, emphasis, pronunciations, audio formats). Voice browser now renders voices from the active provider instead of hard-coded Azure list. Extract ProviderConfigSection and ApiKeyRow sub-components to keep complexity manageable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-up Expand connect-src CSP to allow Google TTS, OpenAI, ElevenLabs, and Amazon Polly endpoints. Add package.json config keys for provider selection and per-provider API credentials. Update SpeechSettings interface and DEFAULT_SPEECH_SETTINGS with provider field and optional per-provider config blocks. Wire sendSpeechSettings() to read and transmit all new provider settings to the webview. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hard-coded Azure API key/region checks in the busy-to-idle auto-speak effect with provider-agnostic flow: resolve provider from settings, check requiresApiKey against the correct per-provider key, and pass the provider to speak(). Add getApiKeyForProvider helper to map provider ID to the right credential field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d6cfe12 to
2b07e5c
Compare
Showcase the multi-provider speech synthesis feature with 6 screenshots demonstrating provider selection, voice browser, fine-tuning controls, and both free (Browser) and premium (Azure) configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds a Speech tab to Settings with 6 text-to-speech providers, all with free tiers:
Architecture
SpeechProviderinterface +SpeechProviderRegistry(matches upstream provider pattern)Key files
webview-ui/src/types/voice.ts— Core type definitionswebview-ui/src/data/speech-providers.ts— Registrywebview-ui/src/utils/speech-providers/— 6 provider implementationswebview-ui/src/utils/speech-playback.ts— Unified playback enginewebview-ui/src/components/settings/SpeechTab.tsx— Settings UITest plan
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com