feat(db): add Azure managed identity authentication for PostgreSQL#12687
feat(db): add Azure managed identity authentication for PostgreSQL#12687RogerHYang wants to merge 2 commits intomainfrom
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
| if not get_env_postgres_use_azure_managed_identity(): | ||
| return |
There was a problem hiding this comment.
The early return here means that when PHOENIX_POSTGRES_USE_AWS_IAM_AUTH=true (and Azure MI is not enabled), this function returns immediately with no AWS-specific startup validation at all.
The old _validate_iam_auth_config() validated three things at startup for the AWS path:
- That
boto3was importable (clear error message if missing) - That
PHOENIX_POSTGRES_USERwas set - That AWS credentials were valid via
sts.get_caller_identity()
After this refactor, none of those checks remain for the AWS-only path. A misconfigured AWS setup (missing aioboto3, missing user, invalid credentials) will now surface as a cryptic connection error on first DB access rather than a clear startup failure.
Consider adding an AWS-specific validation branch analogous to the Azure one:
if not get_env_postgres_use_azure_managed_identity():
if get_env_postgres_use_iam_auth():
_validate_aws_iam_config() # new helper with aioboto3 import check + user check
returnOr, at minimum, reuse the existing user check and add an aioboto3 import check for the AWS path before returning.
See old _validate_iam_auth_config in this diff for the checks that were removed.
There was a problem hiding this comment.
Intentional. The three checks the old _validate_iam_auth_config() did are now either redundant or low-value:
-
aioboto3importable —aws_auth.pydoes this at factory construction with an explicit install-instructions ImportError. The factory runs during app startup, before migrations, so re-checking it inconfig.pywould duplicate the same message a few frames earlier. -
PHOENIX_POSTGRES_USERset —aws_auth.pyraisesValueError("Database user is required for AWS RDS IAM authentication")at factory construction. Same check, same message, right next to the code that uses the value. -
sts.get_caller_identity()credential pre-flight — dropped deliberately.aioboto3.generate_db_auth_tokensurfaces credential failures (NoCredentialsError,ClientError) at the first DB connection, which is the migration step running moments after startup. The startup STS call also adds a synchronous network round-trip and an STS dependency that isn't otherwise on the critical path.
Net: the checks that matter already live at the factory layer; the one that's genuinely gone was providing a slightly-earlier version of the same information the SDK already emits.
6959807 to
c9ef31e
Compare
Adds first-class support for Microsoft Entra managed-identity authentication
when connecting Phoenix to Azure Database for PostgreSQL (Flexible Server),
and along the way modernizes the existing AWS RDS IAM auth path so the two
clouds share a consistent shape.
Azure managed identity (new):
- src/phoenix/db/azure_auth.py: new module providing
create_azure_token_connection_creator(), which wires one
DefaultAzureCredential per factory call and returns an async creator
that obtains a fresh JWT via credential.get_token(scope) on every new
database connection. Token caching and refresh are handled by
azure-identity's built-in MSAL TokenCache (sync and async share the
same machinery via ManagedIdentityClientBase, verified by reading the
installed 1.18.0 and 1.25.3 source under .venv).
- New env vars:
- PHOENIX_POSTGRES_USE_AZURE_MANAGED_IDENTITY (bool, default false)
- PHOENIX_POSTGRES_AZURE_SCOPE (string, default public-cloud scope
https://ossrdbms-aad.database.windows.net/.default; override only
needed for sovereign clouds, e.g. Azure US Government).
- Validation is enforced at the layer that actually uses each input:
mutual exclusion with AWS IAM (cannot enable both simultaneously) is
checked at the engines.py dispatch point; PASSWORD-must-not-be-set
and USER-must-be-set checks live in get_env_postgres_connection_str
alongside URL composition; the azure-identity import check lives in
azure_auth.py next to the code that uses it.
- pyproject.toml: new optional [azure] extra with
azure-identity >= 1.18.0 and aiohttp (required async transport for
azure.identity.aio). The 1.18.0 floor is the first stable release
where ManagedIdentityCredential uses MSAL's TokenCache, which is
the caching behavior Phoenix relies on. Also added to [dev] and
[container] extras so CI and the bundled container image get it.
AWS RDS IAM (simplified + modernized):
- Rename src/phoenix/db/iam_auth.py -> src/phoenix/db/aws_auth.py so
the module name reflects the cloud rather than the auth mechanism.
- Switch the token generator from blocking boto3 to aioboto3 so
generate_db_auth_token no longer blocks the event loop during
connection-open. One aioboto3.Session is reused for the factory's
lifetime; a fresh client context is opened per connection because
aioboto3 does not allow reusing the same async-context object.
- Introduce create_aws_iam_token_connection_creator() as a public
factory that mirrors the Azure one: URL validation, closed-over
creator, same signature. _generate_aws_rds_token is now private.
- Docstrings distinguish client-side SigV4 signing from RDS's
server-side signature verification (RDS does verify via AWS's
internal IAM machinery; that round-trip is paid by RDS, not the
client).
Pool configuration (src/phoenix/db/engines.py):
- Introduce _POOL_RECYCLE_SECONDS = 3300 as a module-level hygiene cap
with a comment explaining it is NOT a token-expiry mechanism.
PostgreSQL validates the credential only during the startup
handshake and never re-checks it for the life of the session (PG
protocol docs), and both Microsoft and AWS explicitly document that
existing connections are unaffected by later token expiry — so
coupling pool_recycle to any token lifetime is neither correct nor
useful.
- Apply pool_pre_ping=True and pool_recycle=_POOL_RECYCLE_SECONDS
consistently across the Azure, AWS, and plain-password main engines.
- NullPool migration engines intentionally omit both knobs:
pool_pre_ping is skipped on freshly-created connections (see
sqlalchemy.pool.base where _ConnectionRecord.fresh is set after
_invoke_creator and the checkout path tests connection_is_fresh),
and NullPool creates a fresh record on every checkout, so there is
no long-lived connection for pool_recycle to age out.
- Hoist the cloud-auth imports inside their respective branches so
importing the engines module does not require azure-identity or
aioboto3 to be installed unless the corresponding feature is used.
- Type the creator variables as
Callable[[], Awaitable[Connection]] | None and narrow with assert
before migration-engine use.
- Add a dispatch-time mutual exclusion check that raises ValueError
when both PHOENIX_POSTGRES_USE_AWS_IAM_AUTH and
PHOENIX_POSTGRES_USE_AZURE_MANAGED_IDENTITY are set to true, so
the misconfiguration fails loudly instead of silently picking one
branch.
Config (src/phoenix/config.py):
- Add new env-var constants + getters for managed identity:
get_env_postgres_use_azure_managed_identity(),
get_env_postgres_azure_scope().
- Rename get_env_postgres_use_iam_auth() ->
get_env_postgres_use_aws_iam_auth() for symmetry with the env var
name. The old AWS getter is replaced in place (no call sites
reference the old name).
- verify_server_environment_variables(): emit a one-line startup
warning if PHOENIX_POSTGRES_AWS_IAM_TOKEN_LIFETIME_SECONDS is still
set in the environment. The variable is silently ignored — it
never correctly controlled token freshness anyway, because the PG
server does not re-validate the token mid-session — but surfacing
the fact that it is being ignored helps users notice the removal.
Tests:
- tests/unit/db/test_azure_auth.py: parsimonious 6-test suite for the
Azure creator: missing host/database/username validation,
connect_args forwarding to asyncpg.connect, per-call token retrieval
with correct password wiring (uses side_effect to hand out two
different tokens and asserts each lands at the right connect call),
and an integration check that the scope from
get_env_postgres_azure_scope() reaches credential.get_token(). The
helper _make_token_response only sets .token because that is the
only field the production code reads.
- tests/unit/db/test_aws_auth.py: new 6-test suite mirroring the
Azure structure — missing host/user validation, connect_args
forwarding, URL defaults (port=5432, database='postgres'),
per-call token generation with correct password wiring, and URL
components (host/port/user) correctly flowing to
generate_db_auth_token kwargs. Uses aioboto3.Session mocks.
- tests/unit/test_config.py: TestGetEnvPostgresUseManagedIdentity and
TestGetEnvPostgresAzureScope cover the new getters. MI-specific
validation is exercised via TestPostgresConnectionStringManagedIdentity
(mutual exclusion, password-conflict, host/user handling), which
covers the layer that actually composes the connection string.
Documentation:
- internal_docs/specs/postgres-cloud-auth-pooling.md: new 384-line
reference covering PostgreSQL protocol semantics, Azure managed
identity flow, AWS RDS IAM flow, how the two clouds compare,
SQLAlchemy pool mechanics, common assumptions (with what actually
happens), Phoenix implementation notes, and the security model.
Every non-trivial claim is cited to a primary source; inferences
and unverified claims are explicitly marked.
- docs/phoenix/self-hosting/configuration/using-azure-managed-identity.mdx:
new user-facing Mintlify page with env-var setup, a note on
non-public-cloud scope overrides, and an explicit note that no
token-lifetime tuning variable is needed.
- docs/phoenix/self-hosting/configuration/using-amazon-aurora.mdx:
remove the stale PHOENIX_POSTGRES_AWS_IAM_TOKEN_LIFETIME_SECONDS
example and replace with a note that Phoenix generates a fresh
token per connection.
- internal_docs/specs/pg-read-replica-routing.md: reference the
renamed get_env_postgres_use_aws_iam_auth() and drop the stale
token-lifetime reference.
- docs.json / docs/phoenix/llms.txt: wire the new Azure guide into
Mintlify navigation.
Helm (helm/values.yaml, helm/README.md, helm/templates/phoenix/configmap.yaml):
- Remove database.postgres.awsIamTokenLifetimeSeconds from values.yaml,
the README table, and the configmap template. Custom values files
that still set this key will not cause helm upgrade to fail; the key
is simply unreferenced by any template.
- scripts/ci/test_helm.py: remove the custom-token-lifetime test case
and drop the token_lifetime parameter from
DatabaseValidators.aws_iam_auth.
Deprecations (backward compatible — no user action required):
- PHOENIX_POSTGRES_AWS_IAM_TOKEN_LIFETIME_SECONDS is silently ignored
with a one-line startup warning. It previously controlled
pool_recycle on the IAM branch; the new code uses a hygiene cap
instead. Users who had pinned this to a short value (e.g. 840s)
will see fewer connection rotations per hour, which is an
improvement (warm pool, fewer TCP+TLS+PG startup handshakes, fewer
SigV4 signing calls).
- helm database.postgres.awsIamTokenLifetimeSeconds is removed from
the chart. Custom values files that still carry the old key will
have `helm upgrade` succeed without error; the key becomes
unreferenced.
Co-Authored-By: Mitch Crites <156001633+mitch-crites-pm@users.noreply.github.com>
c9ef31e to
57439d8
Compare
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
|
@RogerHYang I hit a production issue testing the original PR that may affect this one as well. |
|
@mitch-crites-pm Thanks for the heads-up. I'll take a look at that issue. |
Root Cause AnalysisEvent-loop affinity of
|
`DefaultAzureCredential` from `azure.identity.aio` lazily constructs an `aiohttp.ClientSession` on first `get_token`, and that session is bound to whichever event loop was running at the time. Before this change, the `use_azure` branch of `create_engine` built one credential and reused its creator closure for both the migration engine (driven by a throwaway `asyncio.run(...)` loop inside `migrate_in_thread`) and the primary engine (driven by uvicorn's loop). The migration loop used the credential first, binding its aiohttp session to that loop, and then closed. When the server later opened its first pooled connection, the credential reused its cached session from the dead loop and raised `RuntimeError: Event loop is closed` from inside the pool-open path. `azure_auth.py` is restructured around a single public entry point, `create_azure_engine(url, connect_args, **engine_kwargs) -> AsyncEngine`, which builds its own `DefaultAzureCredential`, wires the async creator into `create_async_engine`, and monkey-patches the returned engine's instance-level `dispose` so that `await engine.dispose()` also closes the credential. The credential, the creator, and the dispose hook are all private to the closure. `aws_auth.py` is rewritten in parallel to expose `create_aws_engine` with the same shape (minus the dispose patch - aioboto3 opens and tears down its aiohttp client stack per call, so nothing loop-affine survives across calls). `engines.py` now calls `create_azure_engine` twice in the `use_azure` branch - once for the primary engine, once for the migration engine - so each engine owns its own credential. Each credential binds to the loop that ends up using it and is closed on that same loop via `engine.dispose()`, which both the server shutdown path and the migration `asyncio.run` block already call. The AWS branch mirrors this shape for symmetry. The lazy-loop-keyed-credential alternative was rejected because it has no clean way to close the migration-loop credential (the loop is already dead by the time the creator notices the change), producing a one-shot `Unclosed client session` warning per process start. Tests are rewritten to mock `create_async_engine` inside each auth module, capture the wired `async_creator`, and exercise it directly. New lifecycle tests assert that `dispose` closes the credential on the success and raise paths and that two `create_azure_engine` calls produce two independent credentials. `internal_docs/specs/postgres-cloud-auth-pooling.md` gains a section documenting the event-loop-affinity mechanism, the Phoenix-specific symptom, why AWS is unaffected, the chosen fix, and the alternative considered.
Card links checkNo broken Card links found. Checked external links in 14.5s |
|
@mitch-crites-pm I fixed the event loop issue on this branch. |
Resolves #9069. Supersedes #12667 (will be closed after this lands).
Summary
Adds first-class support for Microsoft Entra managed-identity authentication when connecting Phoenix to Azure Database for PostgreSQL (Flexible Server). While wiring that in, the existing AWS RDS IAM auth path is modernized so both clouds share a consistent shape — the on-disk module is renamed to
aws_auth.py, token generation is switched from blockingboto3toaioboto3, and the misconceived "alignpool_recycleto the token lifetime" coupling is replaced with a plain hygiene cap on every engine.What's new
Azure managed identity (feature)
src/phoenix/db/azure_auth.py:create_azure_token_connection_creator()returns an async creator that obtains a fresh JWT viacredential.get_token(scope)on every new DB connection. Token caching and refresh are handled byazure-identity's built-in MSALTokenCache(sync and async share the same machinery viaManagedIdentityClientBase, verified by reading the installed 1.18.0 and 1.25.3 source).PHOENIX_POSTGRES_USE_AZURE_MANAGED_IDENTITY(bool, defaultfalse).PHOENIX_POSTGRES_AZURE_SCOPE(string, defaulthttps://ossrdbms-aad.database.windows.net/.default). Only needed for sovereign clouds, e.g. Azure US Government.engines.pydispatch point;PASSWORD-must-not-be-set andUSER-must-be-set checks inget_env_postgres_connection_stralongside URL composition; theazure-identityimport check inazure_auth.pynext to the code that uses it.[azure]extra pinningazure-identity >= 1.18.0andaiohttp.1.18.0is the first stable release whereManagedIdentityCredentialis implemented on top of MSAL'sTokenCache(source-verified).AWS RDS IAM (modernized)
src/phoenix/db/iam_auth.py→src/phoenix/db/aws_auth.pyso the module name reflects the cloud rather than the auth mechanism. Tracked as a rename by git.boto3toaioboto3sogenerate_db_auth_tokenno longer blocks the event loop during connection-open. Oneaioboto3.Sessionis reused for the factory's lifetime; a fresh client context is opened per connection becauseaioboto3does not allow reusing the same async-context object.create_aws_iam_token_connection_creator()mirroring the Azure one (validation + closed-over async creator)._generate_aws_rds_tokenis now private.Pool configuration
_POOL_RECYCLE_SECONDS = 3300(module-level constant inengines.py) applied consistently as a hygiene cap across the Azure, AWS, and plain-password main engines. This is not a token-expiry mechanism — PostgreSQL validates the credential only during the startup handshake and never re-checks it mid-session, and both Microsoft and AWS explicitly document that existing connections are unaffected by later token expiry. Seeinternal_docs/specs/postgres-cloud-auth-pooling.mdfor the full analysis with primary-source citations.pool_pre_ping=Trueon every main engine to catch stale connections (idle-timeout drops, failovers, TCP resets) at checkout rather than surfacing them as errors to callers.sqlalchemy/pool/base.py), and NullPool creates a fresh record per checkout sopool_recyclehas nothing to age out.engines.pydoes not requireazure-identityoraioboto3unless the corresponding feature is actually used.ValueErrorwhen bothPHOENIX_POSTGRES_USE_AWS_IAM_AUTHandPHOENIX_POSTGRES_USE_AZURE_MANAGED_IDENTITYare set totrue, so the misconfiguration fails loudly instead of silently picking one branch.Tests
tests/unit/db/test_azure_auth.py— 6 parsimonious tests covering validation,connect_argsforwarding, per-call token retrieval with password wiring, and scope-from-env integration.tests/unit/db/test_aws_auth.py— new 6-test suite mirroring the Azure structure, including an AWS-specific test that verifies URL defaults (port=5432,database='postgres') kick in when the URL omits them.tests/unit/test_config.py— newTestGetEnvPostgresAzureScope+TestGetEnvPostgresUseManagedIdentity; MI-specific validation is exercised via the existingTestPostgresConnectionStringManagedIdentityclass (mutual exclusion, password-conflict, host/user handling), which covers the layer that actually composes the connection string.Documentation
internal_docs/specs/postgres-cloud-auth-pooling.md: a 384-line reference covering PostgreSQL protocol semantics, Azure managed-identity flow, AWS RDS IAM flow, a side-by-side comparison of how the two clouds actually differ (bearer tokens vs HMAC signing keys; both have a metadata-service call at credential bootstrap), SQLAlchemy pool mechanics, common assumptions that don't hold, and the full security model. Every non-trivial claim cites a primary source; inferences are explicitly marked.docs/phoenix/self-hosting/configuration/using-azure-managed-identity.mdxwith env-var setup, a note on non-public-cloud scope overrides, and an explicit note that no token-lifetime tuning variable is needed.docs/phoenix/self-hosting/configuration/using-amazon-aurora.mdxupdated to remove the stalePHOENIX_POSTGRES_AWS_IAM_TOKEN_LIFETIME_SECONDSexample.docs.json+docs/phoenix/llms.txtwired for the new Azure guide.Helm chart
database.postgres.awsIamTokenLifetimeSecondsremoved fromvalues.yaml, the README table, and the configmap template. Custom values files that still carry the old key will havehelm upgradesucceed without error; the key simply becomes unreferenced.scripts/ci/test_helm.pydrops the custom-token-lifetime test case and thetoken_lifetimeparameter fromDatabaseValidators.aws_iam_auth.Deprecations (backward compatible — no user action required)
PHOENIX_POSTGRES_AWS_IAM_TOKEN_LIFETIME_SECONDSis silently ignored, with a one-line startup warning so users notice. It previously controlledpool_recycleon the IAM branch; the new code uses a hygiene cap instead. Users who had pinned this to a short value (e.g.840s) will see fewer connection rotations per hour — warmer pool, fewer TCP+TLS+PG startup handshakes, fewer SigV4 signing calls.database.postgres.awsIamTokenLifetimeSecondsis removed.helm upgradestill succeeds if a custom values file still sets it.