Skip to content

Support Virtual-GenAI monitoring#13745

Open
peachisai wants to merge 6 commits intoapache:masterfrom
peachisai:Support-GenAI-monitoring
Open

Support Virtual-GenAI monitoring#13745
peachisai wants to merge 6 commits intoapache:masterfrom
peachisai:Support-GenAI-monitoring

Conversation

@peachisai
Copy link
Contributor

@peachisai peachisai commented Mar 16, 2026

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.
捕获 image image image

@wu-sheng wu-sheng added the backend OAP backend related. label Mar 16, 2026
@wu-sheng wu-sheng added this to the 10.4.0 milestone Mar 16, 2026
@wu-sheng
Copy link
Member

Could you show the screenshow at least 3-4 mins data? One dot in only one minute is not very clear.

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Support Virtual-GenAI monitoring

Critical Issues

1. Config file exclude mismatch in server-starter/pom.xml
Exclude says gen-ai-settings.yml but actual file is gen-ai-config.yml.

2. requiredModules() returns empty array
GenAIAnalyzerModuleProvider uses CoreModule in start() but doesn't declare it in requiredModules(). Should return new String[] { CoreModule.NAME }.

3. Module naming convention violation
Existing analyzer modules use lowercase-hyphenated: agent-analyzer, log-analyzer, meter-analyzer. New module genAI-analyzer should be gen-ai-analyzer.

4. Package should be org.apache.skywalking.oap.server.analyzer.genai
Currently uses org.apache.skywalking.oap.meter.analyzer.* which collides with the existing meter-analyzer module's package. Since this is a trace span analyzer (shared across SkyWalking/OTEL/Zipkin receivers), the package should be org.apache.skywalking.oap.server.analyzer.genai.*.

Design Issues

5. Duplicate OAL metric
gen_ai_provider_resp_time and gen_ai_provider_latency_avg both compute from(GenAIProviderAccess.latency).longAvg(). Remove one.

6. totalCost semantics are confusing
Stored value is tokens * costPerM, dashboard divides by 1,000,000. Better to store actual cost by dividing at computation time.

7. Missing NamingControl in VirtualGenAIProcessor
Other virtual processors all use NamingControl to normalize service names. GenAI processor skips this.

8. Tag key inconsistency: gen_ai.stream.ttfr vs timeToFirstToken
Tag says "ttfr", field says "timeToFirstToken", doc doesn't mention this tag at all.

Code Quality Issues

9. GenAIConfigLoader constructor ignores Yaml parameter
Accepts Yaml but creates a new one in loadConfig().

10. fastjson dependency in e2e test
No new dependency version should be added directly in sub-module pom.xml.
Dependencies are managed by BOM. We have decided not to include this repo as it had a lot of critical CVEs before. We have to fix those(re-release patch version), it is too pain.

11. E2E Dockerfile clones unpinned external repo
Dockerfile.provider clones spring-projects/spring-ai-examples without pinning a commit/tag. Any upstream change could break the e2e test.

12. Documentation typo
virtual-genai.md: "Virtual cache represent the Generative AI service nodes" - copy-paste from virtual-cache doc.

Minor Issues

13. Missing newline at end of file in multiple files: gen-ai-config.yml, menu.yaml, SPI files, e2e expected YAMLs, dashboard JSONs.

14. GenAIModelAccessDispatcher bypasses normal dispatch flow - directly calls MetricsStreamProcessor.getInstance().in(traffic).

15. VirtualGenAIProcessor.recordList should be final.

16. Blank line in import block in VirtualServiceAnalysisListener.java between java.util and lombok imports.

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional issue: should use percentile2 instead of percentile

All production OAL files use percentile2(10). The old percentile function only exists in e2e test OAL for backward-compatibility testing.

In virtual-gen-ai.oal, the following lines should use percentile2:

gen_ai_provider_latency_percentile = from(GenAIProviderAccess.latency).percentile2(10);
gen_ai_model_latency_percentile = from(GenAIModelAccess.latency).percentile2(10);
gen_ai_model_ttft_percentile = from(GenAIModelAccess.timeToFirstToken).filter(timeToFirstToken > 0).percentile2(10);

And your UI doesn't show the correct percentile labels.

@wu-sheng wu-sheng added the feature New feature label Mar 16, 2026
@peachisai
Copy link
Contributor Author

peachisai commented Mar 16, 2026

@wu-sheng
hi
Regarding point 6: totalCost semantics
Previously I found, if the cost of a single model call is extremely small—specifically when it is less than 0.001, and storing it as a direct decimal may result in the value being rounded down to 0 which is stored in databases finally. This would lead to significant inaccuracies when calculating the sum for aggregate reports.

@wu-sheng
Copy link
Member

UI side got merged. When you update this PR, please include the submodule update.

@peachisai
Copy link
Contributor Author

peachisai commented Mar 18, 2026

not yet finish, some check fails in my local env, still fixing

config: test/e2e-v2/cases/zipkin/kafka/e2e.yaml
- name: Zipkin BanyanDB
config: test/e2e-v2/cases/zipkin/banyandb/e2e.yaml
- name: Virtual-genai
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Virtual-genai
- name: Virtual GenAI

@@ -0,0 +1,16 @@
# Virtual GenAI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the demo to point to here. I think from Marketplace/General Service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just this. menu.yml is not updated in the /docs/en

@wu-sheng
Copy link
Member

e2e fails, please fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants