Skip to content
Open
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions A102-xds-grpc-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
A102: xDS `GrpcService` Support
----
* Author(s): @markdroth, @sergiitk
* Approver: @ejona86, @dfawley
* Status: {Draft, In Review, Ready for Implementation, Implemented}
* Implemented in: <language, ...>
* Last updated: 2025-09-18
* Discussion at: https://groups.google.com/g/grpc-io/c/3hguVpr8maE

## Abstract

There are several features that require the xDS control plane to configure
gRPC to talk to a side-channel service, such as rate limiting ([A77]),
ExtAuthz ([A92]), and ExtProc ([A93]). This design specifies how the
control plane will configure the communication with these side-channel
services. It also addresses the relevant security implications.

## Background

The control plane configures communication with side-channel services
via the xDS [`GrpcService`
proto](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L29).
This message tells the data plane how to find the side-channel service
and what channel credentials and call credentials to use for that
communication.

### Related Proposals:
* [gRFC A27: xDS-Based Global Load Balancing][A27]
* [gRFC A29: xDS-Based mTLS Security for gRPC Clients and Servers][A29]
* [gRFC A77: xDS Server-Side Rate Limiting][A77] (WIP)
* [gRFC A81: xDS Authority Rewriting][A81]
* [gRFC A92: xDS ExtAuthz Support][A92] (WIP)
* [gRFC A93: xDS ExtProc Support][A93] (WIP)
* [A97: xDS JWT Call Credentials][A97]

[A27]: A27-xds-global-load-balancing.md
[A29]: A29-xds-tls-security.md
[A77]: https://github.com/grpc/proposal/pull/414
[A81]: A81-xds-authority-rewriting.md
[A92]: https://github.com/grpc/proposal/pull/481
[A93]: https://github.com/grpc/proposal/pull/484
Comment on lines +38 to +41
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these PR links will want to be updated later. :(

[A97]: A97-xds-jwt-call-creds.md

## Proposal

gRPC will support the `GrpcService`
message. In that message, gRPC will support only the
[`GoogleGrpc`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L68)
target specifier, not
[`EnvoyGrpc`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L33)
(see "Rationale" section below).

### Security Considerations

The control plane specifying the side-channel target and credentials
introduces a number of potential privilege-escalation attacks from a
compromised control plane. Here are some examples of such attacks:

- Because the side-channel target name comes from the control plane
rather than being configured locally on the client, a compromised
control plane can tell the client to talk to an attacker-controlled
side-channel service. When used for functionality like ExtProc, this
would allow the control plane to get access to the contents of data
plane RPCs.

- Note: Even if the client could enforce the use of a channel credential
type like TLS that verifies that the server's identity matches the
target name, that would not ameliorate this attack, because the
target name itself is coming from the control plane.

- Note: Even if the client locally configured the target name of
the side-channel service but trusted the control plane to specify
the credential type, the control plane could specify
`InsecureCredentials`, and then it would just need to control the
client's name resolution in order to send the client to an
attacker-controlled side-channel service.

- A compromised control plane could instruct gRPC to contact an
attacker-controlled side-channel service using a call credential that
sends an access token, which would leak that access token.

There will be cases where it is acceptable to trust the control plane
to have that kind of privilege-escalation capability, and there will be
other cases where it is not. To differentiate between these two cases, we
will rely on the `trused_xds_server` server feature that was added to the
gRPC xDS bootstrap config in [A81].

When gRPC receives a `GrpcService` proto from an xDS server, it will
check to see if the `trusted_xds_server` server feature is present in
the bootstrap config for that xDS server. If so, then it will trust
the target name and credentials specified in the `GrpcService` proto.
If not, then we will provide a mechanism for it to obtain side-channel
configuration locally from the gRPC xDS bootstrap config.

Specifically, we will add the following new top-level field to the
bootstrap config:

```json5
// The list of side-channel services allowed to be configured via xDS.
"allowed_grpc_services": {
// The key is fully-qualified target URI.
"dns:///ratelimit.example.org:443": {
// List of channel creds. Client will stop at the first type it
// supports. This field is required and must contain at least one
// channel creds type that the client supports.
"channel_creds": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this explicitly only transport credentials, or is it allowed to be transport+call credentials together? IIUC Java bundles the two together into its ChannelCredentials. I'm not sure about C++.

In Go we don't call anything "channel credentials" -- we have TransportCredentials (non per-call creds), PerRPCCredentials (call creds) and a credentials.Bundle which supports both kinds in one object similar to (IIUC) Java's ChannelCredentials.

Copy link
Member Author

@markdroth markdroth Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this works in C++ (and I think Java is similar), there are only two types of creds objects, transport creds (called "channel creds" on the client side and "server creds" on the server side) and call creds. There is a special channel creds type called composite credentials that allows you to attach a call creds object to a channel creds object. But note that composite creds is not a third type of object; it's actually just a channel creds implementation that contains both a channel creds object and a call creds object.

This field is intended to configure channel creds objects. It's conceivably possible that a given channel creds type actually returns a composite creds object here, but that's not the normal case. (We do actually wind up using a composite creds object here for GoogleDefaultCreds, but that's not something we're doing specially here -- it just works out that way because the underlying C++ GoogleDefaultCreds API happens to generate a composite channel creds object.)

In most cases, I would expect the configuration to specify the channel creds and call creds separately. The gRPC implementation will most likely implement that by creating a composite creds object to bundle the channel creds and call creds together, but that's an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention: I don't feel the need to explicitly spell out the above explanation in this gRFC, because the fields here are basically exactly the same thing as the existing fields that we already use to specify the xDS server itself, as per gRFCs A27 and A97.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implication is that Go needs to use the creds Bundle type here.

We probably should have somewhere in our documentation where we explain what a "channel credential" is / is capable of, but I agree this isn't the doc for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that whatever type Go is already using for this field in the xDS server specification, it would also use here. The parsing code for the two fields should be identical.

{
"type": <string containing channel cred type>,
// The "config" field is optional; it may be missing if the
// credential type does not require config parameters.
"config": <JSON object containing config for the type>
}
]
// List of call creds. Optional. Client will apply all call creds
// types that it supports but will ignore any types that it does not
// support.
"call_creds": [
{
"type": <string containing call cred type>,
// The "config" field is optional; it may be missing if the
// credential type does not require config parameters.
"config": <JSON object containing config for the type>
}
]
}
}
```

Note that the `channel_creds` and `call_creds` fields follow the same
format as the top-level `xds_servers` field. See [A27] and [A97] for
details.

When gRPC receives a `GrpcService` proto from an untrusted control
plane, it will look up the target URI from the `GrpcService` proto in
the `allowed_grpc_services` map. If the specified target URI is not
present in the map, then the `GrpcService` proto will be considered
invalid, resulting in gRPC NACKing the xDS resource. If the specified
target URI *is* present in the map, then the `GrpcService` proto will be
considered valid, but gRPC will ignore the credential information from
the `GrpcService` proto; instead, it will use the channel credentials
and call credentials specified in the map.

### Credential Configuration in `GrpcService` Proto

In order to make channel and call credentials more pluggable, we are
introducing new extension points in `GrpcService`, as shown in
https://github.com/envoyproxy/envoy/pull/40823. Specifically, this
introduces the following new fields:

- `channel_credentials_plugin`: This provides an extension point to
specify channel credentials. Just like in the gRPC xDS bootstrap
format, in order to faciliate easier introduction of new credential
types, this field is structured as a list, and the client will iterate
over the list and stop at the first credential type that it supports.
If it does not find any supported credential type in the list, that is
a validation error, and the xDS resource will be NACKed. If it finds
a supported credential type but the config is invalid, then the xDS
resource will also be NACKed. The following extensions will be supported
in this field:

- `envoy.extensions.grpc_service.channel_credentials.google_default.v3.GoogleDefaultCredentials`
- `envoy.extensions.grpc_service.channel_credentials.insecure.v3.InsecureCredentials`
- `envoy.extensions.grpc_service.channel_credentials.local.v3.LocalCredentials`
- `envoy.extensions.grpc_service.channel_credentials.tls.v3.TlsCredentials`:
In this message:
- `root_certificate_provider`: Required. References certificate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: supporting the certificate providers here will be a bit hard in Java, because XdsChannelCredentials can delay the initial loading of credentials, but TlsChannelCredentials can't. We'll need to figure out a solution. (But it's not like there's any sensible alternatives, so we just have to figure something out. We'll probably create a new ChannelCredentials type specific to Netty for this.)

provider instances configured at the top level of the bootstrap
config. Validated the same way as in `CommonTlsContext` (see [A29]).
- `identity_certificate_provider`: Optional. References certificate
provider instances configured at the top level of the bootstrap
config. Validated the same way as in `CommonTlsContext` (see [A29]).
- `envoy.extensions.grpc_service.channel_credentials.xds.v3.XdsCredentials`:
In this message:
- `fallback_credentials`: Required. Specifies a channel credential
plugin to be used as fallback credentials.

- `call_credentials_plugin`: This provides an extension point to specify
call credentials. Just like in the gRPC xDS bootstrap config format,
in order to faciliate easier introduction of new credential types,
this field is structured as a list, and the client will iterate over
the list adding all credential types that it supports, ignoring any
type that it does not support. Note that unlike channel credentials,
call credentials are optional, and there can be more than one, so the
client will need to iterate over the entire list, but it's valid if
none of the specified types are supported. If the client finds a
supported credential type but the config is invalid, then the xDS
resource will be NACKed. The following extensions will be supported
in this field:

- `envoy.extensions.grpc_service.call_credentials.access_token.v3.AccessTokenCredentials`:
In this message:
- `token`: Required. The access token. The token will be added as
an `authorization` header with header `Bearer ` (note trailing
space) followed by the value of this field. Note that the
token will not be sent on the wire unless the connection has
security level PRIVACY_AND_INTEGRITY.

Note that this will require extending the channel credentials and call
credentials registries to support configuration via these protos, in
additional to the JSON formats that they already support for the gRPC
xDS bootstrap config.

### `GrpcService` Proto Validation

When validating a `GrpcService` proto, the following fields will be used:
- [`google_grpc`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L303):
This field must be set. Inside of it:
- [`target_uri`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L254):
Must be set to a valid gRPC target URI. The target URI must be
checked against the resolver registry during xDS resource
validation.
- `channel_credentials_plugin`: See above.
- `call_credentials_plugin`: See above.
- `channel_credentials`, `call_credentials`,
`credentials_factory_name`, and `config`: Ignored; we will use the
new credential plugin fields above instead.
- `stat_prefix`: Ignored; not relevant to gRPC.
- `per_stream_buffer_limit_bytes`: Ignored. We don't have a use-case
for this right now but could add it later if needed.
- `channel_args`: Ignored. Not supportable across languages in gRPC.
- [`timeout`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L308):
If set, this will be used to set the deadline on RPCs sent to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any recommendations for a default value for this timeout, when unset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unset, the RPC will not have any deadline set, just like if you use the normal gRPC client API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this coming up at some point, but my memory is vague. Do we plan to cap this timeout at the timeout value specified on the client RPC, or do we want this to be completely independent? And what if the client RPC is cancelled or times out while this is still ongoing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember discussing this before, but these are good questions.

Capping this based on the deadline of the data plane RPC feels a little clunky to me, for two reasons:

  • From an xDS perspective, it isn't really clear what the data plane RPC's deadline is. Envoy doesn't actually have a single deadline timer the way that gRPC does; instead, it has a variety of different timeout knobs, so it's much less clear which one(s) would be used for this kind of propagation. So this behavior would probably be specific to gRPC.
  • Even in gRPC, we could not do this kind of propagation in every case, because there are cases like RLQS where the side-channel RPC is a long-running stream, not directly associated with individual data plane RPCs. But we could say that we could do this on a case-by-case basis, so it would work for things like ext_authz and ext_proc, even though we wouldn't do it for RLQS.

I think for now, let's not try to do this deadline propagation. But I am open to adding it later if we need it for some reason.

The second question is comparatively easier to answer, but the answer is different in each case, similar to what I described in the second bullet above. In a case like RLQS, where the side-channel call is not associated with any one individual data plane RPC, the termination of the data plane RPC does not affect the side-channel call in any way. However, in a case like ext_authz or ext_proc, where the side-channel RPC is one-to-one with the data plane RPC, we would absolutely terminate the side-channel RPC when the data plane RPC fails. But I don't think we need to note that here, because that should be a facet of the ext_authz and ext_proc designs, not of this design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we envisioning this timeout would be implemented. We can't use service config, because that's necessary to use retries. So this would be an interceptor, or manually specified by each GrpcService user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that it will be up to each individual caller to implement this. For example, for the ext_authz or ext_proc filters, the parsed form of the GrpcService proto will be part of the parsed filter config. The filter can use that data when creating both the channel and the call.

That having been said, I don't object if implementations want to provide some sort of common library for this that can be used by any filter that uses GrpcService.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that it will be up to each individual caller to implement this.

I had thought everything in this would be generic, as that's how it is written. If it is per-caller, that means we'll need to say in every gRFC that uses this one, "and you must implement the specialized parts of GrpcService." So we're definitely going to want to be able to enumerate such responsibilities. And if there's things that are supposed to be done by each caller, that needs to be called out in the design, lest I claim I implemented this feature without doing anything by just saying it'll be done in the filters.

That having been said, I don't object if implementations want to provide some sort of common library for this that can be used by any filter that uses GrpcService.

I'd like that, but I also suspect you've not been thinking about that too much. In particular, that gets more complicated when combined with extra restrictions added from ext_authz/ext_proc:

We do not want to recreate this channel on LDS updates, unless the
target URI or channel credentials changes. The ext_authz filter will
use the filter state retention mechanism described in [A83] to retain
the channel across updates.

When getting a config update, we need to reuse the Channel, but not whatever infrastructure is surrounding the Channel to provide these other features. It seems that needs to be called out in this design, as it is a constraint this subsystem will need to support. Java and Go could do this with interceptors (or interceptor-like things), but that doesn't seem possible in C++. Again, just implementing what is written in this gRFC I would have assumed C++ would have a utility that consumes parsed GrpcService and produces a Channel, but apparently such an implementation would not actually satisfy the goals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out! I've added a section describing the parsed form of the proto, which should hopefully clarify the intent here.

Basically, I am proposing the following division of responsibilies here:

  • The GrpcService parsing code handles all of the interaction with the bootstrap file -- it looks at the trusted_xds_server bit and decides whether to use the target and credentials in the GrpcService proto or whether to look at the allowed_grpc_services map. Either way, if the proto passes validation, the parsed form will contain the target URI and credentials. It will also include the timeout and initial metadata.
  • The component that uses that parsed form (e.g., ext_authz or ext_proc) is responsible for creating the channel using the information described in the parsed form. If it gets a config update that changes the target URI or credentials in the parsed form, then it will be responsible for recreating that channel. Interacting with the filter state retention mechanism will also be the responsibility of this component. This component is also responsible for using the timeout and initial metadata when it creates a call on the channel.

With this approach, the security-sensitive part is implemented just once, not separately in each filter. It's true that individual filters need to actually use that info to create the channel and to create calls on that channel, but that isn't really security-sensitive and doesn't seem like a lot of code, so I don't feel the need to mandate it to be shared between filters -- I could imagine some implementations finding it convenient to have some common code here, and I could imagine others where this logic is too deeply integrated into the filter's functionality to make that worthwhile.

With regard to filter state retention, I specifically do not want to build that in here, because there are going to be cases where GrpcService is used that are not actually part of a filter. As an example, note that the lrs_server field in CDS may actually contain a GrpcService proto, and we could decide to support that at some point if we needed to. I think that interaction is better left in the individual filters that need it (although again, I don't object if implementations wind up with some common code for that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GrpcService parsing code...

Shared parsing code. Sounds great.

The component that uses that parsed form (e.g., ext_authz or ext_proc) is responsible for creating the channel using the information described in the parsed form.

That is not said anywhere in the text. And while that may be obvious in C, it is not obvious in Java, because we can set all that stuff in an interceptor. With the current text, I'd expect there to be shared code to support all fields that this gRFC claims to support.

With regard to filter state retention, I specifically do not want to build that in here,

I wasn't saying to do filter state retention here. I was saying that this was written ignoring filter state retention, even though certain obvious solutions are not permitted because it would be incompatible with your filter state retention design.

Even now, it would seem I could implement channel creation via a Channel createChannel(XdsGrpcService) that is shared across the filters and zero responsibility for filters to set deadline/headers. But in truth, that would be useless for the filters (without a more complex API), because it'd re-create the Channel with changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I was thinking that we didn't need to explicitly state whether something uses shared code or not if we're not mandating it either way, but I've added text to clarify that implementations have this choice.

FWIW, note that even in C-core, we could write a filter that imposes the per-RPC settings automatically. However, I'm just not sure it's worth the bother, since it's really trivial for each component to do the right thing. But we'll see how this turns out as we get further into our implementation.

side-channel service. The value must obey the restrictions specified in
the [`google.protobuf.Duration`
documentation](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#google.protobuf.Duration),
and it must have a positive value.
- [`initial_metadata`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/grpc_service.proto#L315):
If present, specifies headers to be added to RPCs sent to the side-channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we talk to security team about this feature? I thought setting header configuration in general was restricted to trusted sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever explicitly decided that anything that sets headers is security-sensitive (although we have certainly talked about that idea several times). But in any case, I'm happy to get input on this from @matthewstevenson88 and @gtcooke94.

service. Inside of each entry:
- [`key`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/base.proto#L404):
Value length must be in the range [1, 16384). Must be a valid
HTTP/2 header name.
- [`value`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/base.proto#L415):
Specifies the header value. Must be shorter than 16384 bytes. Must
be a valid HTTP/2 header value. Not used if `key` ends in `-bin`
and `raw_value` is set.
- [`raw_value`](https://github.com/envoyproxy/envoy/blob/7ebdf6da0a49240778fd6fed42670157fde371db/api/envoy/config/core/v3/base.proto#L422):
Used only if `key` ends in `-bin`. Must be shorter than 16384 bytes.
Will be base64-encoded on the wire, unless the pure binary metadata
extension from [gRFC G1: True Binary
Metadata](G1-true-binary-metadata.md) is used.
- `envoy_grpc`: This field is not used. See "Rationale" section below
for details.
- `retry_policy`: This field is not used. If retries are needed, they
should be configured in the [service
config](https://github.com/grpc/grpc/blob/master/doc/service_config.md)
for the side-channel service, or by using xDS in the side-channel.

### Temporary environment variable protection

This gRFC does not describe a discrete feature; the functionality it
describes will be used only in the context of other features, which
will each have their own environment variable protection. Therefore,
no additional environment variable protection is needed here.

Note that when implementing one of those other features, it will be
important for the appropriate environment variable guard to cover
reading the `allowed_grpc_services` field in the bootstrap config.

## Rationale

### `GoogleGrpc` vs. `EnvoyGrpc`

Envoy supports both `GoogleGrpc` and `EnvoyGrpc` target specifiers.
The latter uses Envoy's own gRPC implementation, which is essentially
a small wrapper on top of its existing HTTP/2 functionality. Rather
than configuring the side-channel using a gRPC target URI, it specifies
the side-channel using the name of an xDS cluster, which must already be
part of the data plane's configuration. That approach does not make
sense in gRPC, for two reasons.

First, even in Envoy, `EnvoyGrpc` makes sense only in cases where
the data plane's xDS configuration already includes a cluster for the
side-channel service; in any other case, `GoogleGrpc` would be used
instead. But unlike Envoy, gRPC is not a general-purpose proxy that
handles routing requests for multiple services in a single instance;
instead, each gRPC channel is created for one specific target (i.e.,
one particular service) and generally contains only the configuration
for that service, which means that in practice its xDS configuration
never includes the xDS cluster for the side-channel service.

Second, even if gRPC's xDS configuration did include the cluster for
the side-channel service, gRPC's architecture does not support sending
traffic to a specific cluster. Unlike Envoy, gRPC does not have a
distinct Cluster Manager that can be used to select a cluster to send
requests to, which means that it fundamentally doesn't make sense to
use an xDS cluster name to contact the side-channel service.

### Security Concerns

A more comprehensive approach to the security concerns would be to
provide a mechanism to cryptographically sign the xDS resources and have
the client verify the signature. This would ensure that a compromised
control plane would not be able to send arbitrary resources to clients;
instead, the attack would have to happen where the xDS resources are
constructed and signed, which could in principle be better protected.

We are in favor of this approach, but it will require a lot more work,
so we are leaving it as a future improvement.

## Implementation

Will be implemented in C-core, Java, Go, and Node as part of either RLQS
([A77]), ExtAuthz ([A92]), or ExtProc ([A93]), whichever happens to be
implemented first in any given language.