Conversation
WalkthroughThis PR adds Audio Speech Recognition (ASR/transcription) support to the volcengine relay adapter by intercepting Changes
Sequence DiagramsequenceDiagram
participant Client
participant Adapter as Volcengine Adapter
participant Upload as Relay Upload Helper
participant ASR as Volcengine ASR API
Client->>Adapter: AudioTranscription Request + Audio File
Adapter->>Upload: Upload Audio File
Upload-->>Adapter: Audio URL + Format
Adapter->>Adapter: Generate Request ID, Marshal Submit Payload
Adapter->>ASR: POST /submit (Audio URL, Metadata)
ASR-->>Adapter: Submit Response (Request ID, Status Header)
Adapter->>Adapter: Store Request ID & Body in Context
Adapter-->>Client: Return (nil, nil) - Defer Response
loop Poll Until Terminal Status
Adapter->>ASR: GET /query (Request ID)
ASR-->>Adapter: Query Response (Status, Result/Message)
end
alt Success
Adapter->>Adapter: Convert Result to Text or Verbose JSON
Adapter-->>Client: Transcription Response + Usage
else Timeout/Error
Adapter-->>Client: Error Response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@relay/channel/volcengine/asr.go`:
- Around line 116-121: The code currently passes OpenAI ISO-639-1 codes straight
through (request.Language → audio.Language) which Doubao rejects; modify the
block that unmarshals request.Language (the common.Unmarshal usage) to map
two-letter ISO codes to BCP-47 locales using a small lookup table for common
languages (e.g., "en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language
to the mapped value, and return an error or leave audio.Language unset when the
ISO code is not in the map (i.e., reject unmapped values) so Doubao receives
only valid BCP-47 codes.
- Around line 215-223: The switch handling responseFormat must properly support
"srt" and "vtt" instead of defaulting to JSON: detect when responseFormat ==
"srt" or "vtt", iterate over result.Result.Utterances (which contain Text,
StartTime, EndTime) and render subtitle output (SRT with numeric cue indices and
"HH:MM:SS,mmm" timestamps; VTT with "WEBVTT" header and "HH:MM:SS.mmm"
timestamps), set the appropriate Content-Type ("text/plain; charset=utf-8" for
both or "text/vtt" for vtt), and return the generated bytes via c.Data; if you
prefer not to implement rendering, explicitly return a 400-level error (e.g.,
c.JSON with http.StatusBadRequest) for unsupported formats. Keep
convertToVerboseJSON and the JSON/default branch intact for "text" and "json"
respectively, and reference result.Result.Utterances when building the subtitle
output.
- Line 107: detectAudioFormat currently defaults unknown extensions to "mp3",
causing incorrect metadata; change detectAudioFormat to return (string, error)
and return an error for any extension not in the allowed set
{"raw","wav","mp3","ogg"}, then update the call in convertASRRequest (where
audioFormat := detectAudioFormat(...)) to receive (audioFormat, err) and
propagate the error (return it) instead of proceeding; ensure error text clearly
states "unsupported audio format" and includes the original filename/extension.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: add10d9a-f8d5-4503-b6b6-4d0db12b6a8b
📒 Files selected for processing (3)
relay/channel/volcengine/adaptor.gorelay/channel/volcengine/asr.gorelay/channel/volcengine/constants.go
| return nil, fmt.Errorf("failed to upload audio file: %w", err) | ||
| } | ||
|
|
||
| audioFormat := detectAudioFormat(fileHeader.Filename) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n relay/channel/volcengine/asr.go | head -150Repository: QuantumNous/new-api
Length of output: 5094
🏁 Script executed:
cat -n relay/channel/volcengine/asr.go | tail -n +300 | head -50Repository: QuantumNous/new-api
Length of output: 1726
🏁 Script executed:
# Check for error handling patterns in relay/channel implementations
rg -A 5 "detectAudioFormat" relay/channel/volcengine/asr.goRepository: QuantumNous/new-api
Length of output: 420
🏁 Script executed:
# Check if detectAudioFormat is called elsewhere
rg "detectAudioFormat" relay/channel/volcengine/Repository: QuantumNous/new-api
Length of output: 230
🏁 Script executed:
# Check error handling patterns in the convertASRRequest function
cat -n relay/channel/volcengine/asr.go | sed -n '87,143p'Repository: QuantumNous/new-api
Length of output: 2064
🏁 Script executed:
# Look for similar error handling patterns in relay/channel implementations
rg -B 2 -A 2 "return nil, fmt.Errorf" relay/channel/volcengine/asr.go | head -30Repository: QuantumNous/new-api
Length of output: 788
Return error for unsupported audio formats instead of silently defaulting to MP3.
At line 107, detectAudioFormat() silently defaults unknown extensions to mp3, which sends incorrect metadata upstream. Doubao API only supports raw, wav, mp3, and ogg per official documentation. When users upload unsupported formats (.flac, .aiff, .wma, etc.), failing fast with an error is better than allowing the request to proceed with wrong format metadata.
The convertASRRequest() function already establishes error handling patterns. Update detectAudioFormat() to return an error for unsupported formats and check the error at line 107:
Suggested fix
- audioFormat := detectAudioFormat(fileHeader.Filename)
+ audioFormat, err := detectAudioFormat(fileHeader.Filename)
+ if err != nil {
+ return nil, err
+ }
@@
-func detectAudioFormat(filename string) string {
+func detectAudioFormat(filename string) (string, error) {
ext := strings.TrimPrefix(strings.ToLower(filepath.Ext(filename)), ".")
switch ext {
case "mp3", "wav", "ogg", "raw":
- return ext
+ return ext, nil
case "pcm":
- return "raw"
+ return "raw", nil
default:
- return "mp3"
+ return "", fmt.Errorf("unsupported audio format: %s", ext)
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@relay/channel/volcengine/asr.go` at line 107, detectAudioFormat currently
defaults unknown extensions to "mp3", causing incorrect metadata; change
detectAudioFormat to return (string, error) and return an error for any
extension not in the allowed set {"raw","wav","mp3","ogg"}, then update the call
in convertASRRequest (where audioFormat := detectAudioFormat(...)) to receive
(audioFormat, err) and propagate the error (return it) instead of proceeding;
ensure error text clearly states "unsupported audio format" and includes the
original filename/extension.
| // Pass through OpenAI language parameter (ISO-639-1) to Doubao language format | ||
| if request.Language != nil { | ||
| var lang string | ||
| if err := common.Unmarshal(request.Language, &lang); err == nil && lang != "" { | ||
| audio.Language = lang | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -150 relay/channel/volcengine/asr.go | tail -50Repository: QuantumNous/new-api
Length of output: 1372
🏁 Script executed:
# Check the full context around lines 116-121
cat -n relay/channel/volcengine/asr.go | sed -n '100,140p'Repository: QuantumNous/new-api
Length of output: 1352
🏁 Script executed:
# Check if there's any language mapping utility already in the codebase
rg -A 5 -B 5 "language.*map|iso.*639|locale" relay/channel/volcengine/Repository: QuantumNous/new-api
Length of output: 45
🏁 Script executed:
# Check common utilities for any language mapping
rg -l "language.*map|iso.*639" --type goRepository: QuantumNous/new-api
Length of output: 45
🌐 Web query:
OpenAI transcription API language parameter format ISO-639-1
💡 Result:
The language parameter in OpenAI's Audio Transcriptions API (used for Whisper and other models like gpt-4o-transcribe) is a string in ISO-639-1 two-letter code format. For example, "en" for English. To use it in a request (e.g., via curl): --form 'language=en' This improves transcription accuracy and latency by specifying the input audio language, avoiding auto-detection. Supported for models including whisper-1, gpt-4o-transcribe, etc. Some models also accept select ISO-639-3 codes.
Citations:
- 1: https://developers.openai.com/api/reference/typescript/resources/audio/subresources/transcriptions/
- 2: https://developers.openai.com/docs/api-reference/audio/createTranscription
- 3: https://developers.openai.com/api/reference/python/resources/audio
- 4: https://openpublicapis.com/api/whisper-openai
🌐 Web query:
Volcengine Doubao ASR API language parameter format supported languages
💡 Result:
The Volcengine Doubao ASR API uses a "language" string parameter in the audio configuration object of the request payload (JSON format). If omitted, it defaults to supporting Chinese (Mandarin zh-CN), English (en-US), and Chinese dialects like Shanghai dialect, Minnan, Sichuan, Shaanxi, and Cantonese. Supported languages (BCP-47 style codes): - zh-CN (Chinese Mandarin) - en-US (English) - ja-JP (Japanese) - id-ID (Indonesian) - es-MX (Spanish) - pt-BR (Portuguese) - de-DE (German) - fr-FR (French) - ko-KR (Korean) - fil-PH (Filipino) - ms-MY (Malay) - th-TH (Thai) - ar-SA (Arabic) - it-IT (Italian) - bn-BD (Bengali) - el-GR (Greek) - nl-NL (Dutch) - ru-RU (Russian) - tr-TR (Turkish) - vi-VN (Vietnamese) - pl-PL (Polish) - ro-RO (Romanian) - ne-NP (Nepali) - uk-UA (Ukrainian) - yue-CN (Cantonese) This applies to both streaming ASR (wss://openspeech.bytedance.com/api/v3/sauc/bigmodel) and file-based ASR (https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit). Example: "language": "de-DE" for German audio.
Citations:
- 1: https://www.volcengine.com/docs/6561/1354869
- 2: https://www.volcengine.com/docs/6561/1354868
- 3: https://www.volcengine.com/docs/6561/1354871
🏁 Script executed:
# Search for language mapping in relay common utilities
rg -A 3 "ISO.*639|iso.*lang|langmap|LanguageMap" relay/Repository: QuantumNous/new-api
Length of output: 401
🏁 Script executed:
# Check if there's a mapping in volcengine channel directory
fd . relay/channel/volcengine --type fRepository: QuantumNous/new-api
Length of output: 238
🏁 Script executed:
# Check the struct definitions to ensure there's no hidden conversion
rg "DoubaoASRAudio|type.*Audio" relay/channel/volcengine/asr.go -A 5Repository: QuantumNous/new-api
Length of output: 1037
🏁 Script executed:
# See if there's any conversion logic in the initialize or constructor
rg "func.*init|func.*Init" relay/channel/volcengine/asr.goRepository: QuantumNous/new-api
Length of output: 45
Add ISO-639-1 to BCP-47 language code mapping for Volcengine Doubao.
OpenAI transcription accepts ISO-639-1 codes (en, es, de, etc.), but Volcengine Doubao ASR requires BCP-47 locale codes (en-US, es-MX, de-DE, etc.). The current passthrough at lines 116-121 sends OpenAI's two-letter codes directly to Doubao without conversion, causing upstream rejections. Implement a mapping table for common languages or reject unmapped values.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@relay/channel/volcengine/asr.go` around lines 116 - 121, The code currently
passes OpenAI ISO-639-1 codes straight through (request.Language →
audio.Language) which Doubao rejects; modify the block that unmarshals
request.Language (the common.Unmarshal usage) to map two-letter ISO codes to
BCP-47 locales using a small lookup table for common languages (e.g.,
"en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language to the mapped
value, and return an error or leave audio.Language unset when the ISO code is
not in the map (i.e., reject unmapped values) so Doubao receives only valid
BCP-47 codes.
| switch responseFormat { | ||
| case "text": | ||
| c.Data(http.StatusOK, "text/plain; charset=utf-8", []byte(resultText)) | ||
| case "verbose_json": | ||
| verboseResp := convertToVerboseJSON(result) | ||
| c.JSON(http.StatusOK, verboseResp) | ||
| default: // "json", "srt", "vtt" fallback to json | ||
| c.JSON(http.StatusOK, dto.AudioResponse{Text: resultText}) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the volcengine/asr.go file to understand the full context
find . -path "*/relay/channel/volcengine/asr.go" -type fRepository: QuantumNous/new-api
Length of output: 96
🏁 Script executed:
# Let's examine the file structure
ls -la relay/channel/volcengine/ 2>/dev/null || echo "Path not found, searching..."
find . -name "asr.go" -path "*/volcengine*"Repository: QuantumNous/new-api
Length of output: 518
🏁 Script executed:
# Search for the asr.go file more broadly
fd -type f -name "*.go" | grep -i volcengineRepository: QuantumNous/new-api
Length of output: 233
🏁 Script executed:
# Read the asr.go file to see the context around lines 215-223
cat -n relay/channel/volcengine/asr.go | head -250 | tail -80Repository: QuantumNous/new-api
Length of output: 3030
🏁 Script executed:
# Let's see what the result struct contains
cat -n relay/channel/volcengine/asr.go | head -50Repository: QuantumNous/new-api
Length of output: 1583
🏁 Script executed:
# Search for convertToVerboseJSON and see what data structure we're working with
rg -A 20 "func convertToVerboseJSON" relay/channel/volcengine/asr.goRepository: QuantumNous/new-api
Length of output: 650
🏁 Script executed:
# Look for the result type definition
rg -B 5 "func pollASRResult" relay/channel/volcengine/asr.go | head -30Repository: QuantumNous/new-api
Length of output: 214
🏁 Script executed:
# Find the DoubaoASRQueryResponse struct definition
rg -A 30 "type DoubaoASRQueryResponse struct" relay/channel/volcengine/asr.goRepository: QuantumNous/new-api
Length of output: 1049
🏁 Script executed:
# Check the dto.AudioRequest to see if srt/vtt are valid formats
rg -A 10 "type AudioRequest struct" --type goRepository: QuantumNous/new-api
Length of output: 771
🏁 Script executed:
# Search for any existing srt or vtt generation in the codebase
rg -i "\.srt|\.vtt|srt|vtt" --type go | head -30Repository: QuantumNous/new-api
Length of output: 145
🏁 Script executed:
# Let's check if other audio/transcription channels handle srt/vtt properly
rg -l "response_format" relay/channel/**/*.go | head -10Repository: QuantumNous/new-api
Length of output: 277
🏁 Script executed:
# Check what response_format values are documented or validated elsewhere
rg -B 3 -A 3 "response_format.*srt|response_format.*vtt" --type goRepository: QuantumNous/new-api
Length of output: 45
🏁 Script executed:
# Check the volcengine adaptor for any response_format handling or documentation
rg -B 5 -A 5 "response_format" relay/channel/volcengine/adaptor.goRepository: QuantumNous/new-api
Length of output: 224
🏁 Script executed:
# Check if OpenAI or other channels handle srt/vtt response formats
rg -B 5 -A 5 "srt|vtt" relay/channel/openai/ 2>/dev/null | head -40Repository: QuantumNous/new-api
Length of output: 45
🏁 Script executed:
# Look for OpenAI audio transcription handling to see how they handle response formats
find relay/channel -name "*.go" -path "*/openai/*" | xargs grep -l "response_format" 2>/dev/nullRepository: QuantumNous/new-api
Length of output: 47
🏁 Script executed:
# Search for how other channels handle audio response formats
rg "response_format" relay/channel/ --type go -A 3 -B 1 | grep -A 3 -B 1 "srt\|vtt" | head -50Repository: QuantumNous/new-api
Length of output: 45
🏁 Script executed:
# Check OpenAI relay code to see how they handle audio/transcription
fd -path "*/relay/channel/openai/*" -name "*.go" | xargs grep -l "audio\|transcription" 2>/dev/nullRepository: QuantumNous/new-api
Length of output: 233
🏁 Script executed:
# Look at the actual OpenAI transcription implementation if it exists
find relay/channel/openai -name "*.go" -exec grep -l "Audio" {} \;Repository: QuantumNous/new-api
Length of output: 204
🏁 Script executed:
# Check the OpenAI audio.go to see how they handle response_format
cat relay/channel/openai/audio.goRepository: QuantumNous/new-api
Length of output: 5080
Implement proper handling for srt and vtt response formats instead of silently falling back to JSON.
The API accepts response_format values including "srt" and "vtt" (as shown in the comment at line 221), but the code silently returns JSON for these formats. Callers requesting subtitle output will receive the wrong content type and payload format.
The data needed to render these formats is already available—the result.Result.Utterances array contains Text, StartTime, and EndTime for each segment (matching the structure already being used in convertToVerboseJSON). Either generate the proper subtitle formats from this data or return a 4xx status for unsupported formats.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@relay/channel/volcengine/asr.go` around lines 215 - 223, The switch handling
responseFormat must properly support "srt" and "vtt" instead of defaulting to
JSON: detect when responseFormat == "srt" or "vtt", iterate over
result.Result.Utterances (which contain Text, StartTime, EndTime) and render
subtitle output (SRT with numeric cue indices and "HH:MM:SS,mmm" timestamps; VTT
with "WEBVTT" header and "HH:MM:SS.mmm" timestamps), set the appropriate
Content-Type ("text/plain; charset=utf-8" for both or "text/vtt" for vtt), and
return the generated bytes via c.Data; if you prefer not to implement rendering,
explicitly return a 400-level error (e.g., c.JSON with http.StatusBadRequest)
for unsupported formats. Keep convertToVerboseJSON and the JSON/default branch
intact for "text" and "json" respectively, and reference
result.Result.Utterances when building the subtitle output.
官方文档:
https://www.volcengine.com/docs/6561/1354868模型id:
volc.seedasr.auc📝 变更描述 / Description
📸 运行证明 / Proof of Work
请求:
Summary by CodeRabbit
volc.seedasr.aucfor audio transcription tasks.