Skip to content

支持豆包(ASR)语音识别2.0#4245

Open
feitianbubu wants to merge 1 commit intoQuantumNous:mainfrom
feitianbubu:pr/49fdb06b12b7df7a17d0facd5a83546a4d81034c
Open

支持豆包(ASR)语音识别2.0#4245
feitianbubu wants to merge 1 commit intoQuantumNous:mainfrom
feitianbubu:pr/49fdb06b12b7df7a17d0facd5a83546a4d81034c

Conversation

@feitianbubu
Copy link
Copy Markdown
Contributor

@feitianbubu feitianbubu commented Apr 14, 2026

⚠️ 提交说明 / PR Notice

官方文档: https://www.volcengine.com/docs/6561/1354868
模型id: volc.seedasr.auc

📝 变更描述 / Description

  1. 将OpenAI ASR接口格式 转换为 豆包ASR
  2. 将豆包异步语音识别结果转换为同步接口, 转换为OpenAI格式返回

📸 运行证明 / Proof of Work

请求:

curl http://localhost:30001/v1/audio/transcriptions \
  --request POST \
  --header 'Content-Type: multipart/form-data' \
  --form 'file=@test.wav' \
  --form 'model=volc.seedasr.auc'
image

Summary by CodeRabbit

  • New Features
    • Added audio transcription support for the Volcengine channel, enabling conversion of audio files to text.
    • Added new model volc.seedasr.auc for audio transcription tasks.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

Walkthrough

This PR adds Audio Speech Recognition (ASR/transcription) support to the volcengine relay adapter by intercepting AudioTranscription requests, uploading audio files, submitting ASR jobs, polling for results, and returning transcription responses in text or verbose JSON format.

Changes

Cohort / File(s) Summary
ASR Implementation
relay/channel/volcengine/asr.go
New file implementing complete ASR flow: audio upload via multipart form, ASR job submission with polling, format detection, response serialization in text/verbose JSON modes, and token estimation based on audio duration.
Adapter Integration
relay/channel/volcengine/adaptor.go
Modified ConvertAudioRequest, GetRequestURL, SetupRequestHeader, DoRequest, and DoResponse methods to route AudioTranscription mode through new ASR handlers, including request body caching and header setup for ASR endpoints.
Model Configuration
relay/channel/volcengine/constants.go
Added new model identifier "volc.seedasr.auc" to ModelList for ASR capability registration.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Adapter as Volcengine Adapter
    participant Upload as Relay Upload Helper
    participant ASR as Volcengine ASR API
    
    Client->>Adapter: AudioTranscription Request + Audio File
    Adapter->>Upload: Upload Audio File
    Upload-->>Adapter: Audio URL + Format
    Adapter->>Adapter: Generate Request ID, Marshal Submit Payload
    Adapter->>ASR: POST /submit (Audio URL, Metadata)
    ASR-->>Adapter: Submit Response (Request ID, Status Header)
    Adapter->>Adapter: Store Request ID & Body in Context
    Adapter-->>Client: Return (nil, nil) - Defer Response
    
    loop Poll Until Terminal Status
        Adapter->>ASR: GET /query (Request ID)
        ASR-->>Adapter: Query Response (Status, Result/Message)
    end
    
    alt Success
        Adapter->>Adapter: Convert Result to Text or Verbose JSON
        Adapter-->>Client: Transcription Response + Usage
    else Timeout/Error
        Adapter-->>Client: Error Response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • creamlike1024

Poem

🐰 Hop hop, transcription's here,
Audio floats through the air so clear,
Poll and wait, the words take shape,
From speech to text, there's no escape!
Volcengine's ears now hear so well, 🎤✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '支持豆包(ASR)语音识别2.0' clearly and specifically describes the main change: adding support for Doubao ASR (speech recognition) 2.0 functionality, which aligns with the changeset implementing ASR request/response conversion for the volcengine channel.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@relay/channel/volcengine/asr.go`:
- Around line 116-121: The code currently passes OpenAI ISO-639-1 codes straight
through (request.Language → audio.Language) which Doubao rejects; modify the
block that unmarshals request.Language (the common.Unmarshal usage) to map
two-letter ISO codes to BCP-47 locales using a small lookup table for common
languages (e.g., "en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language
to the mapped value, and return an error or leave audio.Language unset when the
ISO code is not in the map (i.e., reject unmapped values) so Doubao receives
only valid BCP-47 codes.
- Around line 215-223: The switch handling responseFormat must properly support
"srt" and "vtt" instead of defaulting to JSON: detect when responseFormat ==
"srt" or "vtt", iterate over result.Result.Utterances (which contain Text,
StartTime, EndTime) and render subtitle output (SRT with numeric cue indices and
"HH:MM:SS,mmm" timestamps; VTT with "WEBVTT" header and "HH:MM:SS.mmm"
timestamps), set the appropriate Content-Type ("text/plain; charset=utf-8" for
both or "text/vtt" for vtt), and return the generated bytes via c.Data; if you
prefer not to implement rendering, explicitly return a 400-level error (e.g.,
c.JSON with http.StatusBadRequest) for unsupported formats. Keep
convertToVerboseJSON and the JSON/default branch intact for "text" and "json"
respectively, and reference result.Result.Utterances when building the subtitle
output.
- Line 107: detectAudioFormat currently defaults unknown extensions to "mp3",
causing incorrect metadata; change detectAudioFormat to return (string, error)
and return an error for any extension not in the allowed set
{"raw","wav","mp3","ogg"}, then update the call in convertASRRequest (where
audioFormat := detectAudioFormat(...)) to receive (audioFormat, err) and
propagate the error (return it) instead of proceeding; ensure error text clearly
states "unsupported audio format" and includes the original filename/extension.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: add10d9a-f8d5-4503-b6b6-4d0db12b6a8b

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8661d and e4734eb.

📒 Files selected for processing (3)
  • relay/channel/volcengine/adaptor.go
  • relay/channel/volcengine/asr.go
  • relay/channel/volcengine/constants.go

return nil, fmt.Errorf("failed to upload audio file: %w", err)
}

audioFormat := detectAudioFormat(fileHeader.Filename)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n relay/channel/volcengine/asr.go | head -150

Repository: QuantumNous/new-api

Length of output: 5094


🏁 Script executed:

cat -n relay/channel/volcengine/asr.go | tail -n +300 | head -50

Repository: QuantumNous/new-api

Length of output: 1726


🏁 Script executed:

# Check for error handling patterns in relay/channel implementations
rg -A 5 "detectAudioFormat" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 420


🏁 Script executed:

# Check if detectAudioFormat is called elsewhere
rg "detectAudioFormat" relay/channel/volcengine/

Repository: QuantumNous/new-api

Length of output: 230


🏁 Script executed:

# Check error handling patterns in the convertASRRequest function
cat -n relay/channel/volcengine/asr.go | sed -n '87,143p'

Repository: QuantumNous/new-api

Length of output: 2064


🏁 Script executed:

# Look for similar error handling patterns in relay/channel implementations
rg -B 2 -A 2 "return nil, fmt.Errorf" relay/channel/volcengine/asr.go | head -30

Repository: QuantumNous/new-api

Length of output: 788


Return error for unsupported audio formats instead of silently defaulting to MP3.

At line 107, detectAudioFormat() silently defaults unknown extensions to mp3, which sends incorrect metadata upstream. Doubao API only supports raw, wav, mp3, and ogg per official documentation. When users upload unsupported formats (.flac, .aiff, .wma, etc.), failing fast with an error is better than allowing the request to proceed with wrong format metadata.

The convertASRRequest() function already establishes error handling patterns. Update detectAudioFormat() to return an error for unsupported formats and check the error at line 107:

Suggested fix
-	audioFormat := detectAudioFormat(fileHeader.Filename)
+	audioFormat, err := detectAudioFormat(fileHeader.Filename)
+	if err != nil {
+		return nil, err
+	}
@@
-func detectAudioFormat(filename string) string {
+func detectAudioFormat(filename string) (string, error) {
 	ext := strings.TrimPrefix(strings.ToLower(filepath.Ext(filename)), ".")
 	switch ext {
 	case "mp3", "wav", "ogg", "raw":
-		return ext
+		return ext, nil
 	case "pcm":
-		return "raw"
+		return "raw", nil
 	default:
-		return "mp3"
+		return "", fmt.Errorf("unsupported audio format: %s", ext)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@relay/channel/volcengine/asr.go` at line 107, detectAudioFormat currently
defaults unknown extensions to "mp3", causing incorrect metadata; change
detectAudioFormat to return (string, error) and return an error for any
extension not in the allowed set {"raw","wav","mp3","ogg"}, then update the call
in convertASRRequest (where audioFormat := detectAudioFormat(...)) to receive
(audioFormat, err) and propagate the error (return it) instead of proceeding;
ensure error text clearly states "unsupported audio format" and includes the
original filename/extension.

Comment on lines +116 to +121
// Pass through OpenAI language parameter (ISO-639-1) to Doubao language format
if request.Language != nil {
var lang string
if err := common.Unmarshal(request.Language, &lang); err == nil && lang != "" {
audio.Language = lang
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -150 relay/channel/volcengine/asr.go | tail -50

Repository: QuantumNous/new-api

Length of output: 1372


🏁 Script executed:

# Check the full context around lines 116-121
cat -n relay/channel/volcengine/asr.go | sed -n '100,140p'

Repository: QuantumNous/new-api

Length of output: 1352


🏁 Script executed:

# Check if there's any language mapping utility already in the codebase
rg -A 5 -B 5 "language.*map|iso.*639|locale" relay/channel/volcengine/

Repository: QuantumNous/new-api

Length of output: 45


🏁 Script executed:

# Check common utilities for any language mapping
rg -l "language.*map|iso.*639" --type go

Repository: QuantumNous/new-api

Length of output: 45


🌐 Web query:

OpenAI transcription API language parameter format ISO-639-1

💡 Result:

The language parameter in OpenAI's Audio Transcriptions API (used for Whisper and other models like gpt-4o-transcribe) is a string in ISO-639-1 two-letter code format. For example, "en" for English. To use it in a request (e.g., via curl): --form 'language=en' This improves transcription accuracy and latency by specifying the input audio language, avoiding auto-detection. Supported for models including whisper-1, gpt-4o-transcribe, etc. Some models also accept select ISO-639-3 codes.

Citations:


🌐 Web query:

Volcengine Doubao ASR API language parameter format supported languages

💡 Result:

The Volcengine Doubao ASR API uses a "language" string parameter in the audio configuration object of the request payload (JSON format). If omitted, it defaults to supporting Chinese (Mandarin zh-CN), English (en-US), and Chinese dialects like Shanghai dialect, Minnan, Sichuan, Shaanxi, and Cantonese. Supported languages (BCP-47 style codes): - zh-CN (Chinese Mandarin) - en-US (English) - ja-JP (Japanese) - id-ID (Indonesian) - es-MX (Spanish) - pt-BR (Portuguese) - de-DE (German) - fr-FR (French) - ko-KR (Korean) - fil-PH (Filipino) - ms-MY (Malay) - th-TH (Thai) - ar-SA (Arabic) - it-IT (Italian) - bn-BD (Bengali) - el-GR (Greek) - nl-NL (Dutch) - ru-RU (Russian) - tr-TR (Turkish) - vi-VN (Vietnamese) - pl-PL (Polish) - ro-RO (Romanian) - ne-NP (Nepali) - uk-UA (Ukrainian) - yue-CN (Cantonese) This applies to both streaming ASR (wss://openspeech.bytedance.com/api/v3/sauc/bigmodel) and file-based ASR (https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit). Example: "language": "de-DE" for German audio.

Citations:


🏁 Script executed:

# Search for language mapping in relay common utilities
rg -A 3 "ISO.*639|iso.*lang|langmap|LanguageMap" relay/

Repository: QuantumNous/new-api

Length of output: 401


🏁 Script executed:

# Check if there's a mapping in volcengine channel directory
fd . relay/channel/volcengine --type f

Repository: QuantumNous/new-api

Length of output: 238


🏁 Script executed:

# Check the struct definitions to ensure there's no hidden conversion
rg "DoubaoASRAudio|type.*Audio" relay/channel/volcengine/asr.go -A 5

Repository: QuantumNous/new-api

Length of output: 1037


🏁 Script executed:

# See if there's any conversion logic in the initialize or constructor
rg "func.*init|func.*Init" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 45


Add ISO-639-1 to BCP-47 language code mapping for Volcengine Doubao.

OpenAI transcription accepts ISO-639-1 codes (en, es, de, etc.), but Volcengine Doubao ASR requires BCP-47 locale codes (en-US, es-MX, de-DE, etc.). The current passthrough at lines 116-121 sends OpenAI's two-letter codes directly to Doubao without conversion, causing upstream rejections. Implement a mapping table for common languages or reject unmapped values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@relay/channel/volcengine/asr.go` around lines 116 - 121, The code currently
passes OpenAI ISO-639-1 codes straight through (request.Language →
audio.Language) which Doubao rejects; modify the block that unmarshals
request.Language (the common.Unmarshal usage) to map two-letter ISO codes to
BCP-47 locales using a small lookup table for common languages (e.g.,
"en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language to the mapped
value, and return an error or leave audio.Language unset when the ISO code is
not in the map (i.e., reject unmapped values) so Doubao receives only valid
BCP-47 codes.

Comment on lines +215 to +223
switch responseFormat {
case "text":
c.Data(http.StatusOK, "text/plain; charset=utf-8", []byte(resultText))
case "verbose_json":
verboseResp := convertToVerboseJSON(result)
c.JSON(http.StatusOK, verboseResp)
default: // "json", "srt", "vtt" fallback to json
c.JSON(http.StatusOK, dto.AudioResponse{Text: resultText})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the volcengine/asr.go file to understand the full context
find . -path "*/relay/channel/volcengine/asr.go" -type f

Repository: QuantumNous/new-api

Length of output: 96


🏁 Script executed:

# Let's examine the file structure
ls -la relay/channel/volcengine/ 2>/dev/null || echo "Path not found, searching..."
find . -name "asr.go" -path "*/volcengine*"

Repository: QuantumNous/new-api

Length of output: 518


🏁 Script executed:

# Search for the asr.go file more broadly
fd -type f -name "*.go" | grep -i volcengine

Repository: QuantumNous/new-api

Length of output: 233


🏁 Script executed:

# Read the asr.go file to see the context around lines 215-223
cat -n relay/channel/volcengine/asr.go | head -250 | tail -80

Repository: QuantumNous/new-api

Length of output: 3030


🏁 Script executed:

# Let's see what the result struct contains
cat -n relay/channel/volcengine/asr.go | head -50

Repository: QuantumNous/new-api

Length of output: 1583


🏁 Script executed:

# Search for convertToVerboseJSON and see what data structure we're working with
rg -A 20 "func convertToVerboseJSON" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 650


🏁 Script executed:

# Look for the result type definition
rg -B 5 "func pollASRResult" relay/channel/volcengine/asr.go | head -30

Repository: QuantumNous/new-api

Length of output: 214


🏁 Script executed:

# Find the DoubaoASRQueryResponse struct definition
rg -A 30 "type DoubaoASRQueryResponse struct" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 1049


🏁 Script executed:

# Check the dto.AudioRequest to see if srt/vtt are valid formats
rg -A 10 "type AudioRequest struct" --type go

Repository: QuantumNous/new-api

Length of output: 771


🏁 Script executed:

# Search for any existing srt or vtt generation in the codebase
rg -i "\.srt|\.vtt|srt|vtt" --type go | head -30

Repository: QuantumNous/new-api

Length of output: 145


🏁 Script executed:

# Let's check if other audio/transcription channels handle srt/vtt properly
rg -l "response_format" relay/channel/**/*.go | head -10

Repository: QuantumNous/new-api

Length of output: 277


🏁 Script executed:

# Check what response_format values are documented or validated elsewhere
rg -B 3 -A 3 "response_format.*srt|response_format.*vtt" --type go

Repository: QuantumNous/new-api

Length of output: 45


🏁 Script executed:

# Check the volcengine adaptor for any response_format handling or documentation
rg -B 5 -A 5 "response_format" relay/channel/volcengine/adaptor.go

Repository: QuantumNous/new-api

Length of output: 224


🏁 Script executed:

# Check if OpenAI or other channels handle srt/vtt response formats
rg -B 5 -A 5 "srt|vtt" relay/channel/openai/ 2>/dev/null | head -40

Repository: QuantumNous/new-api

Length of output: 45


🏁 Script executed:

# Look for OpenAI audio transcription handling to see how they handle response formats
find relay/channel -name "*.go" -path "*/openai/*" | xargs grep -l "response_format" 2>/dev/null

Repository: QuantumNous/new-api

Length of output: 47


🏁 Script executed:

# Search for how other channels handle audio response formats
rg "response_format" relay/channel/ --type go -A 3 -B 1 | grep -A 3 -B 1 "srt\|vtt" | head -50

Repository: QuantumNous/new-api

Length of output: 45


🏁 Script executed:

# Check OpenAI relay code to see how they handle audio/transcription
fd -path "*/relay/channel/openai/*" -name "*.go" | xargs grep -l "audio\|transcription" 2>/dev/null

Repository: QuantumNous/new-api

Length of output: 233


🏁 Script executed:

# Look at the actual OpenAI transcription implementation if it exists
find relay/channel/openai -name "*.go" -exec grep -l "Audio" {} \;

Repository: QuantumNous/new-api

Length of output: 204


🏁 Script executed:

# Check the OpenAI audio.go to see how they handle response_format
cat relay/channel/openai/audio.go

Repository: QuantumNous/new-api

Length of output: 5080


Implement proper handling for srt and vtt response formats instead of silently falling back to JSON.

The API accepts response_format values including "srt" and "vtt" (as shown in the comment at line 221), but the code silently returns JSON for these formats. Callers requesting subtitle output will receive the wrong content type and payload format.

The data needed to render these formats is already available—the result.Result.Utterances array contains Text, StartTime, and EndTime for each segment (matching the structure already being used in convertToVerboseJSON). Either generate the proper subtitle formats from this data or return a 4xx status for unsupported formats.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@relay/channel/volcengine/asr.go` around lines 215 - 223, The switch handling
responseFormat must properly support "srt" and "vtt" instead of defaulting to
JSON: detect when responseFormat == "srt" or "vtt", iterate over
result.Result.Utterances (which contain Text, StartTime, EndTime) and render
subtitle output (SRT with numeric cue indices and "HH:MM:SS,mmm" timestamps; VTT
with "WEBVTT" header and "HH:MM:SS.mmm" timestamps), set the appropriate
Content-Type ("text/plain; charset=utf-8" for both or "text/vtt" for vtt), and
return the generated bytes via c.Data; if you prefer not to implement rendering,
explicitly return a 400-level error (e.g., c.JSON with http.StatusBadRequest)
for unsupported formats. Keep convertToVerboseJSON and the JSON/default branch
intact for "text" and "json" respectively, and reference
result.Result.Utterances when building the subtitle output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant