支持豆包(ASR)语音识别2.0 by feitianbubu · Pull Request #4245 · QuantumNous/new-api

feitianbubu · 2026-04-14T12:52:24Z

⚠️ 提交说明 / PR Notice

官方文档: https://www.volcengine.com/docs/6561/1354868
模型id: volc.seedasr.auc

📝 变更描述 / Description

将OpenAI ASR接口格式转换为豆包ASR
将豆包异步语音识别结果转换为同步接口, 转换为OpenAI格式返回

📸 运行证明 / Proof of Work

请求:

curl http://localhost:30001/v1/audio/transcriptions \
  --request POST \
  --header 'Content-Type: multipart/form-data' \
  --form 'file=@test.wav' \
  --form 'model=volc.seedasr.auc'

Summary by CodeRabbit

New Features
- Added audio transcription support for the Volcengine channel, enabling conversion of audio files to text.
- Added new model volc.seedasr.auc for audio transcription tasks.

coderabbitai · 2026-04-14T12:52:41Z

Walkthrough

This PR adds Audio Speech Recognition (ASR/transcription) support to the volcengine relay adapter by intercepting AudioTranscription requests, uploading audio files, submitting ASR jobs, polling for results, and returning transcription responses in text or verbose JSON format.

Changes

Cohort / File(s)	Summary
ASR Implementation `relay/channel/volcengine/asr.go`	New file implementing complete ASR flow: audio upload via multipart form, ASR job submission with polling, format detection, response serialization in text/verbose JSON modes, and token estimation based on audio duration.
Adapter Integration `relay/channel/volcengine/adaptor.go`	Modified `ConvertAudioRequest`, `GetRequestURL`, `SetupRequestHeader`, `DoRequest`, and `DoResponse` methods to route `AudioTranscription` mode through new ASR handlers, including request body caching and header setup for ASR endpoints.
Model Configuration `relay/channel/volcengine/constants.go`	Added new model identifier `"volc.seedasr.auc"` to `ModelList` for ASR capability registration.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Adapter as Volcengine Adapter
    participant Upload as Relay Upload Helper
    participant ASR as Volcengine ASR API
    
    Client->>Adapter: AudioTranscription Request + Audio File
    Adapter->>Upload: Upload Audio File
    Upload-->>Adapter: Audio URL + Format
    Adapter->>Adapter: Generate Request ID, Marshal Submit Payload
    Adapter->>ASR: POST /submit (Audio URL, Metadata)
    ASR-->>Adapter: Submit Response (Request ID, Status Header)
    Adapter->>Adapter: Store Request ID & Body in Context
    Adapter-->>Client: Return (nil, nil) - Defer Response
    
    loop Poll Until Terminal Status
        Adapter->>ASR: GET /query (Request ID)
        ASR-->>Adapter: Query Response (Status, Result/Message)
    end
    
    alt Success
        Adapter->>Adapter: Convert Result to Text or Verbose JSON
        Adapter-->>Client: Transcription Response + Usage
    else Timeout/Error
        Adapter-->>Client: Error Response
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: doubao tts support streaming realtime audio #2087: Modifies the same volcengine adaptor methods (ConvertAudioRequest, GetRequestURL, DoRequest, DoResponse) to add TTS streaming via WebSocket, creating potential interaction or overlap with ASR integration points.
feat: fill in some custom fields for vllm-omini #4154: Adds Language field to dto.AudioRequest, which is directly used by the new convertASRRequest method for language passthrough to the ASR backend.
豆包语音2.0音色支持情感,情绪,音量 #2068: Modifies the same ConvertAudioRequest method in volcengine adaptor, so changes must be coordinated to avoid conflicts in audio request routing logic.

Suggested reviewers

creamlike1024

Poem

🐰 Hop hop, transcription's here,
Audio floats through the air so clear,
Poll and wait, the words take shape,
From speech to text, there's no escape!
Volcengine's ears now hear so well, 🎤✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '支持豆包(ASR)语音识别2.0' clearly and specifically describes the main change: adding support for Doubao ASR (speech recognition) 2.0 functionality, which aligns with the changeset implementing ASR request/response conversion for the volcengine channel.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@relay/channel/volcengine/asr.go`:
- Around line 116-121: The code currently passes OpenAI ISO-639-1 codes straight
through (request.Language → audio.Language) which Doubao rejects; modify the
block that unmarshals request.Language (the common.Unmarshal usage) to map
two-letter ISO codes to BCP-47 locales using a small lookup table for common
languages (e.g., "en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language
to the mapped value, and return an error or leave audio.Language unset when the
ISO code is not in the map (i.e., reject unmapped values) so Doubao receives
only valid BCP-47 codes.
- Around line 215-223: The switch handling responseFormat must properly support
"srt" and "vtt" instead of defaulting to JSON: detect when responseFormat ==
"srt" or "vtt", iterate over result.Result.Utterances (which contain Text,
StartTime, EndTime) and render subtitle output (SRT with numeric cue indices and
"HH:MM:SS,mmm" timestamps; VTT with "WEBVTT" header and "HH:MM:SS.mmm"
timestamps), set the appropriate Content-Type ("text/plain; charset=utf-8" for
both or "text/vtt" for vtt), and return the generated bytes via c.Data; if you
prefer not to implement rendering, explicitly return a 400-level error (e.g.,
c.JSON with http.StatusBadRequest) for unsupported formats. Keep
convertToVerboseJSON and the JSON/default branch intact for "text" and "json"
respectively, and reference result.Result.Utterances when building the subtitle
output.
- Line 107: detectAudioFormat currently defaults unknown extensions to "mp3",
causing incorrect metadata; change detectAudioFormat to return (string, error)
and return an error for any extension not in the allowed set
{"raw","wav","mp3","ogg"}, then update the call in convertASRRequest (where
audioFormat := detectAudioFormat(...)) to receive (audioFormat, err) and
propagate the error (return it) instead of proceeding; ensure error text clearly
states "unsupported audio format" and includes the original filename/extension.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: add10d9a-f8d5-4503-b6b6-4d0db12b6a8b

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8661d and e4734eb.

📒 Files selected for processing (3)

relay/channel/volcengine/adaptor.go
relay/channel/volcengine/asr.go
relay/channel/volcengine/constants.go

coderabbitai · 2026-04-14T13:03:42Z

+		return nil, fmt.Errorf("failed to upload audio file: %w", err)
+	}
+
+	audioFormat := detectAudioFormat(fileHeader.Filename)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n relay/channel/volcengine/asr.go | head -150

Repository: QuantumNous/new-api

Length of output: 5094

🏁 Script executed:

cat -n relay/channel/volcengine/asr.go | tail -n +300 | head -50

Repository: QuantumNous/new-api

Length of output: 1726

🏁 Script executed:

# Check for error handling patterns in relay/channel implementations rg -A 5 "detectAudioFormat" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 420

🏁 Script executed:

# Check if detectAudioFormat is called elsewhere rg "detectAudioFormat" relay/channel/volcengine/

Repository: QuantumNous/new-api

Length of output: 230

🏁 Script executed:

# Check error handling patterns in the convertASRRequest function cat -n relay/channel/volcengine/asr.go | sed -n '87,143p'

Repository: QuantumNous/new-api

Length of output: 2064

🏁 Script executed:

# Look for similar error handling patterns in relay/channel implementations rg -B 2 -A 2 "return nil, fmt.Errorf" relay/channel/volcengine/asr.go | head -30

Repository: QuantumNous/new-api

Length of output: 788

Return error for unsupported audio formats instead of silently defaulting to MP3.

At line 107, detectAudioFormat() silently defaults unknown extensions to mp3, which sends incorrect metadata upstream. Doubao API only supports raw, wav, mp3, and ogg per official documentation. When users upload unsupported formats (.flac, .aiff, .wma, etc.), failing fast with an error is better than allowing the request to proceed with wrong format metadata.

The convertASRRequest() function already establishes error handling patterns. Update detectAudioFormat() to return an error for unsupported formats and check the error at line 107:

Suggested fix

- audioFormat := detectAudioFormat(fileHeader.Filename) + audioFormat, err := detectAudioFormat(fileHeader.Filename) + if err != nil { + return nil, err + } @@ -func detectAudioFormat(filename string) string { +func detectAudioFormat(filename string) (string, error) { ext := strings.TrimPrefix(strings.ToLower(filepath.Ext(filename)), ".") switch ext { case "mp3", "wav", "ogg", "raw": - return ext + return ext, nil case "pcm": - return "raw" + return "raw", nil default: - return "mp3" + return "", fmt.Errorf("unsupported audio format: %s", ext) } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@relay/channel/volcengine/asr.go` at line 107, detectAudioFormat currently defaults unknown extensions to "mp3", causing incorrect metadata; change detectAudioFormat to return (string, error) and return an error for any extension not in the allowed set {"raw","wav","mp3","ogg"}, then update the call in convertASRRequest (where audioFormat := detectAudioFormat(...)) to receive (audioFormat, err) and propagate the error (return it) instead of proceeding; ensure error text clearly states "unsupported audio format" and includes the original filename/extension.

coderabbitai · 2026-04-14T13:03:42Z

+	// Pass through OpenAI language parameter (ISO-639-1) to Doubao language format
+	if request.Language != nil {
+		var lang string
+		if err := common.Unmarshal(request.Language, &lang); err == nil && lang != "" {
+			audio.Language = lang
+		}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -150 relay/channel/volcengine/asr.go | tail -50

Repository: QuantumNous/new-api

Length of output: 1372

🏁 Script executed:

# Check the full context around lines 116-121 cat -n relay/channel/volcengine/asr.go | sed -n '100,140p'

Repository: QuantumNous/new-api

Length of output: 1352

🏁 Script executed:

# Check if there's any language mapping utility already in the codebase rg -A 5 -B 5 "language.*map|iso.*639|locale" relay/channel/volcengine/

Repository: QuantumNous/new-api

Length of output: 45

🏁 Script executed:

# Check common utilities for any language mapping rg -l "language.*map|iso.*639" --type go

Repository: QuantumNous/new-api

Length of output: 45

🌐 Web query:

OpenAI transcription API language parameter format ISO-639-1

💡 Result:

The language parameter in OpenAI's Audio Transcriptions API (used for Whisper and other models like gpt-4o-transcribe) is a string in ISO-639-1 two-letter code format. For example, "en" for English. To use it in a request (e.g., via curl): --form 'language=en' This improves transcription accuracy and latency by specifying the input audio language, avoiding auto-detection. Supported for models including whisper-1, gpt-4o-transcribe, etc. Some models also accept select ISO-639-3 codes.

Citations:

1: https://developers.openai.com/api/reference/typescript/resources/audio/subresources/transcriptions/

2: https://developers.openai.com/docs/api-reference/audio/createTranscription

3: https://developers.openai.com/api/reference/python/resources/audio

4: https://openpublicapis.com/api/whisper-openai

🌐 Web query:

Volcengine Doubao ASR API language parameter format supported languages

💡 Result:

The Volcengine Doubao ASR API uses a "language" string parameter in the audio configuration object of the request payload (JSON format). If omitted, it defaults to supporting Chinese (Mandarin zh-CN), English (en-US), and Chinese dialects like Shanghai dialect, Minnan, Sichuan, Shaanxi, and Cantonese. Supported languages (BCP-47 style codes): - zh-CN (Chinese Mandarin) - en-US (English) - ja-JP (Japanese) - id-ID (Indonesian) - es-MX (Spanish) - pt-BR (Portuguese) - de-DE (German) - fr-FR (French) - ko-KR (Korean) - fil-PH (Filipino) - ms-MY (Malay) - th-TH (Thai) - ar-SA (Arabic) - it-IT (Italian) - bn-BD (Bengali) - el-GR (Greek) - nl-NL (Dutch) - ru-RU (Russian) - tr-TR (Turkish) - vi-VN (Vietnamese) - pl-PL (Polish) - ro-RO (Romanian) - ne-NP (Nepali) - uk-UA (Ukrainian) - yue-CN (Cantonese) This applies to both streaming ASR (wss://openspeech.bytedance.com/api/v3/sauc/bigmodel) and file-based ASR (https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit). Example: "language": "de-DE" for German audio.

Citations:

1: https://www.volcengine.com/docs/6561/1354869

2: https://www.volcengine.com/docs/6561/1354868

3: https://www.volcengine.com/docs/6561/1354871

🏁 Script executed:

# Search for language mapping in relay common utilities rg -A 3 "ISO.*639|iso.*lang|langmap|LanguageMap" relay/

Repository: QuantumNous/new-api

Length of output: 401

🏁 Script executed:

# Check if there's a mapping in volcengine channel directory fd . relay/channel/volcengine --type f

Repository: QuantumNous/new-api

Length of output: 238

🏁 Script executed:

# Check the struct definitions to ensure there's no hidden conversion rg "DoubaoASRAudio|type.*Audio" relay/channel/volcengine/asr.go -A 5

Repository: QuantumNous/new-api

Length of output: 1037

🏁 Script executed:

# See if there's any conversion logic in the initialize or constructor rg "func.*init|func.*Init" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 45

Add ISO-639-1 to BCP-47 language code mapping for Volcengine Doubao.

OpenAI transcription accepts ISO-639-1 codes (en, es, de, etc.), but Volcengine Doubao ASR requires BCP-47 locale codes (en-US, es-MX, de-DE, etc.). The current passthrough at lines 116-121 sends OpenAI's two-letter codes directly to Doubao without conversion, causing upstream rejections. Implement a mapping table for common languages or reject unmapped values.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@relay/channel/volcengine/asr.go` around lines 116 - 121, The code currently passes OpenAI ISO-639-1 codes straight through (request.Language → audio.Language) which Doubao rejects; modify the block that unmarshals request.Language (the common.Unmarshal usage) to map two-letter ISO codes to BCP-47 locales using a small lookup table for common languages (e.g., "en"→"en-US", "es"→"es-MX", "de"→"de-DE"), set audio.Language to the mapped value, and return an error or leave audio.Language unset when the ISO code is not in the map (i.e., reject unmapped values) so Doubao receives only valid BCP-47 codes.

coderabbitai · 2026-04-14T13:03:42Z

+	switch responseFormat {
+	case "text":
+		c.Data(http.StatusOK, "text/plain; charset=utf-8", []byte(resultText))
+	case "verbose_json":
+		verboseResp := convertToVerboseJSON(result)
+		c.JSON(http.StatusOK, verboseResp)
+	default: // "json", "srt", "vtt" fallback to json
+		c.JSON(http.StatusOK, dto.AudioResponse{Text: resultText})
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the volcengine/asr.go file to understand the full context find . -path "*/relay/channel/volcengine/asr.go" -type f

Repository: QuantumNous/new-api

Length of output: 96

🏁 Script executed:

# Let's examine the file structure ls -la relay/channel/volcengine/ 2>/dev/null || echo "Path not found, searching..." find . -name "asr.go" -path "*/volcengine*"

Repository: QuantumNous/new-api

Length of output: 518

🏁 Script executed:

# Search for the asr.go file more broadly fd -type f -name "*.go" | grep -i volcengine

Repository: QuantumNous/new-api

Length of output: 233

🏁 Script executed:

# Read the asr.go file to see the context around lines 215-223 cat -n relay/channel/volcengine/asr.go | head -250 | tail -80

Repository: QuantumNous/new-api

Length of output: 3030

🏁 Script executed:

# Let's see what the result struct contains cat -n relay/channel/volcengine/asr.go | head -50

Repository: QuantumNous/new-api

Length of output: 1583

🏁 Script executed:

# Search for convertToVerboseJSON and see what data structure we're working with rg -A 20 "func convertToVerboseJSON" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 650

🏁 Script executed:

# Look for the result type definition rg -B 5 "func pollASRResult" relay/channel/volcengine/asr.go | head -30

Repository: QuantumNous/new-api

Length of output: 214

🏁 Script executed:

# Find the DoubaoASRQueryResponse struct definition rg -A 30 "type DoubaoASRQueryResponse struct" relay/channel/volcengine/asr.go

Repository: QuantumNous/new-api

Length of output: 1049

🏁 Script executed:

# Check the dto.AudioRequest to see if srt/vtt are valid formats rg -A 10 "type AudioRequest struct" --type go

Repository: QuantumNous/new-api

Length of output: 771

🏁 Script executed:

# Search for any existing srt or vtt generation in the codebase rg -i "\.srt|\.vtt|srt|vtt" --type go | head -30

Repository: QuantumNous/new-api

Length of output: 145

🏁 Script executed:

# Let's check if other audio/transcription channels handle srt/vtt properly rg -l "response_format" relay/channel/**/*.go | head -10

Repository: QuantumNous/new-api

Length of output: 277

🏁 Script executed:

# Check what response_format values are documented or validated elsewhere rg -B 3 -A 3 "response_format.*srt|response_format.*vtt" --type go

Repository: QuantumNous/new-api

Length of output: 45

🏁 Script executed:

# Check the volcengine adaptor for any response_format handling or documentation rg -B 5 -A 5 "response_format" relay/channel/volcengine/adaptor.go

Repository: QuantumNous/new-api

Length of output: 224

🏁 Script executed:

# Check if OpenAI or other channels handle srt/vtt response formats rg -B 5 -A 5 "srt|vtt" relay/channel/openai/ 2>/dev/null | head -40

Repository: QuantumNous/new-api

Length of output: 45

🏁 Script executed:

# Look for OpenAI audio transcription handling to see how they handle response formats find relay/channel -name "*.go" -path "*/openai/*" | xargs grep -l "response_format" 2>/dev/null

Repository: QuantumNous/new-api

Length of output: 47

🏁 Script executed:

# Search for how other channels handle audio response formats rg "response_format" relay/channel/ --type go -A 3 -B 1 | grep -A 3 -B 1 "srt\|vtt" | head -50

Repository: QuantumNous/new-api

Length of output: 45

🏁 Script executed:

# Check OpenAI relay code to see how they handle audio/transcription fd -path "*/relay/channel/openai/*" -name "*.go" | xargs grep -l "audio\|transcription" 2>/dev/null

Repository: QuantumNous/new-api

Length of output: 233

🏁 Script executed:

# Look at the actual OpenAI transcription implementation if it exists find relay/channel/openai -name "*.go" -exec grep -l "Audio" {} \;

Repository: QuantumNous/new-api

Length of output: 204

🏁 Script executed:

# Check the OpenAI audio.go to see how they handle response_format cat relay/channel/openai/audio.go

Repository: QuantumNous/new-api

Length of output: 5080

Implement proper handling for srt and vtt response formats instead of silently falling back to JSON.

The API accepts response_format values including "srt" and "vtt" (as shown in the comment at line 221), but the code silently returns JSON for these formats. Callers requesting subtitle output will receive the wrong content type and payload format.

The data needed to render these formats is already available—the result.Result.Utterances array contains Text, StartTime, and EndTime for each segment (matching the structure already being used in convertToVerboseJSON). Either generate the proper subtitle formats from this data or return a 4xx status for unsupported formats.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@relay/channel/volcengine/asr.go` around lines 215 - 223, The switch handling responseFormat must properly support "srt" and "vtt" instead of defaulting to JSON: detect when responseFormat == "srt" or "vtt", iterate over result.Result.Utterances (which contain Text, StartTime, EndTime) and render subtitle output (SRT with numeric cue indices and "HH:MM:SS,mmm" timestamps; VTT with "WEBVTT" header and "HH:MM:SS.mmm" timestamps), set the appropriate Content-Type ("text/plain; charset=utf-8" for both or "text/vtt" for vtt), and return the generated bytes via c.Data; if you prefer not to implement rendering, explicitly return a 400-level error (e.g., c.JSON with http.StatusBadRequest) for unsupported formats. Keep convertToVerboseJSON and the JSON/default branch intact for "text" and "json" respectively, and reference result.Result.Utterances when building the subtitle output.

feat: add doubao asr 2.0

e4734eb

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

支持豆包(ASR)语音识别2.0#4245

支持豆包(ASR)语音识别2.0#4245
feitianbubu wants to merge 1 commit intoQuantumNous:mainfrom
feitianbubu:pr/49fdb06b12b7df7a17d0facd5a83546a4d81034c

feitianbubu commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 14, 2026

Uh oh!

coderabbitai bot Apr 14, 2026

Uh oh!

coderabbitai bot Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

feitianbubu commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ 提交说明 / PR Notice

📝 变更描述 / Description

📸 运行证明 / Proof of Work

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feitianbubu commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading