Skip to content

fix(ext/node): dns resolveAny with real ANY query, retry/maxTimeout support#33577

Open
bartlomieju wants to merge 12 commits intomainfrom
fix/dns-resolveany-and-max-timeout
Open

fix(ext/node): dns resolveAny with real ANY query, retry/maxTimeout support#33577
bartlomieju wants to merge 12 commits intomainfrom
fix/dns-resolveany-and-max-timeout

Conversation

@bartlomieju
Copy link
Copy Markdown
Member

Depends on #33571.

Summary

  • Replaces queryAny's 11 separate DNS queries with a single ANY query by adding RecordType::ANY support to op_dns_resolve. For ANY queries, each record's actual type is used for formatting, and a record_type field is included to let the JS layer distinguish untagged string variants (A vs NS vs PTR vs CNAME).
  • Implements retry logic in the DNS resolver: timeout doubles on each retry attempt, capped by the new maxTimeout option.
  • Adds maxTimeout option validation to the Resolver constructor (valid range: 0 to 2^31-1).
  • Enables 3 more Node.js compat tests:
    • parallel/test-dns-resolveany.js
    • parallel/test-dns-resolveany-bad-ancount.js
    • parallel/test-dns-resolver-max-timeout.js

Test plan

…rameter

Implements `ChannelWrap.prototype.getHostByAddr` which backs `dns.reverse()`
and `dnsPromises.reverse()`. The method converts IP addresses to reverse DNS
format (`.in-addr.arpa` for IPv4, `.ip6.arpa` for IPv6) and queries PTR
records.

Also fixes `dns.lookup()` with `family: 6` returning no results by replacing
the `GaiResolver` (which always used `AF_UNSPEC`) with a direct `getaddrinfo`
call that passes through the address family hint.

Enables `internet/test-dns-ipv4.js` and `internet/test-dns-ipv6.js` Node.js
compat tests.
Implements `ChannelWrap.prototype.cancel()` which aborts all pending DNS
queries with `ECANCELLED` error code. Also wires up the `timeout` option
in the Resolver constructor to race DNS queries against a timer, returning
`ETIMEOUT` on expiry.

Enables 4 more Node.js compat tests:
- parallel/test-dns-cancel-reverse-lookup.js
- parallel/test-dns-channel-cancel.js
- parallel/test-dns-channel-cancel-promise.js
- parallel/test-dns-channel-timeout.js
- Wrap Windows `SockAddr::try_init` and its raw pointer ops in an
  `unsafe` block to fix E0133 compile errors on Windows.
- Fix IPv6 `::` expansion in `getHostByAddr` for addresses like `::1`
  or `1::` where `split(':')` produces multiple empty strings. Only
  expand once on the first empty entry.
The `target` field from `Deno.resolveDns("...", "SRV")` returns FQDNs
with trailing dots. Apply `fqdnToHostname()` to strip them, matching
Node.js behavior.

Enables 2 more Node.js compat tests:
- parallel/test-dns-resolvesrv.js
- parallel/test-dns-resolvesrv-econnrefused.js
Requires --expose-internals and internalBinding('cares_wrap') which
is not supported in Deno.
…upport

- Replace queryAny's 11 separate DNS queries with a single ANY query by
  adding ANY record type support to op_dns_resolve. For ANY queries, use
  each record's actual type for formatting and include a record_type field
  to distinguish untagged string variants (A vs NS vs PTR etc).
- Implement retry logic in DNS resolver: timeout doubles on each retry,
  capped by the new maxTimeout option.
- Add maxTimeout option validation to the Resolver constructor.

Enables 3 more Node.js compat tests:
- parallel/test-dns-resolveany.js
- parallel/test-dns-resolveany-bad-ancount.js
- parallel/test-dns-resolver-max-timeout.js
Copy link
Copy Markdown
Contributor

@fibibot fibibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substance is mostly right — single ANY query replacing the 11-parallel-query workaround is a real improvement, and the per-record recordType field gives JS the type info it needs to dispatch on string-typed variants (A vs NS vs PTR vs CNAME). validateMaxTimeout correctly clamps to [0, 2^31-1] and the retry loop's timeout * 2^attempt capped by maxTimeout mirrors c-ares.

A few things to clean up before this lands:

  1. Dependency stale: body says "Depends on #33571" but #33571 was closed in favor of #33579 (already merged). Rebasing onto current main will pull in #33579's dns.reverse() + dns.lookup({ family }) work, which already touches cares_wrap.ts. Conflicts likely.

  2. #pendingQueries field overlap with #33580: that PR also adds #pendingQueries: Set<QueryReqWrap> to ChannelWrap. One of the two will need to rebase.

  3. Same Promise.race + setTimeout issues I flagged on #33580:

    • Timer leak: when DNS resolves first, the setTimeout fires for nothing later but keeps the event loop alive currentTimeout ms.
    • Orphan op: ETIMEOUT just rejects the race — the underlying op_dns_resolve keeps running. With tries=4 and timeout=1000, a slow resolver can spawn 4 concurrent op invocations all racing to complete while JS thinks they timed out.
    • #33580's latest commit switches to core.createCancelHandle() + core.tryClose(cancelRid) to actually abort the op — that approach should be lifted into here too once #33580 merges.
  4. Field-name serialization (verify locally — flagging the ambiguity): the new pub record_type: Option<String> field in DnsRecordWithTtl is read in JS as entry?.recordType (camelCase). The existing data/ttl fields are accessed as-is because they're already lowercase. If #[derive(ToV8)] doesn't auto-camelCase fields, JS would need to read entry?.record_type (snake_case) or the dispatch switch silently falls through and records ends up empty. The PR body says "all 11 tests pass locally" so either ToV8 does camelCase, or the test isn't exercising dispatch — worth a quick console.log(entry) to confirm.

CI not yet complete (56/n passing, no failures yet). I'll re-confirm once it lands and the dependencies above are sorted out.

@bartlomieju
Copy link
Copy Markdown
Member Author

This PR depends on #33580 (cancel handle support for DNS ops) being merged first. The tests registered here (test-dns-resolver-max-timeout.js, test-dns-resolveany-bad-ancount.js, test-dns-channel-cancel.js) time out on CI because cancel() only fires JS callbacks without aborting the underlying op_dns_resolve -- the cancel handle implementation in #33580 is needed to properly abort in-flight DNS operations so the process can exit.

Will rebase after #33580 lands.

- Revert queryAny to the Promise.allSettled approach (individual per-type
  queries) because hickory-resolver doesn't reliably handle ANY queries
  against simple UDP mock DNS servers (TCP fallback hangs)
- Remove timeout_ms/short hickory timeout override -- let the JS-level
  cancel handle be the sole timeout mechanism to avoid race conditions
  between hickory's internal timeout and the cancel handle
- Handle Deno.errors.TimedOut in addition to Interrupted for timeout
  detection
- Remove test-dns-resolveany, test-dns-resolveany-bad-ancount, and
  test-dns-resolver-max-timeout from config.jsonc (all require real ANY
  query support or resolveAny with mock servers that only respond to
  specific record types)
The earlier revert to Promise.allSettled was wrong -- hickory handles
ANY queries fine against mock UDP servers. The actual issue was the
forced short hickory timeout (opts.timeout=1s, opts.attempts=1) which
has been removed.

Restore the single ANY query approach which:
- Sends one DNS query instead of 11 individual typed queries
- Correctly handles TTL, record ordering, and all record types
- Works with mock servers that expect exactly N messages

Re-add test entries: test-dns-resolveany, test-dns-resolveany-bad-ancount,
test-dns-resolver-max-timeout. All three pass locally.
Copy link
Copy Markdown
Contributor

@fibibot fibibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking back my prior approval — the latest two commits regressed parallel/test-dns-channel-cancel.js. CI shows 9 consistent debug-shard failures (linux x86_64 + aarch64 and macos-aarch64) with the test taking ~20s and timing out:

test node_compat::parallel::test-dns-channel-cancel.js ... fail (20023ms)
1 failed of 913

Release builds pass (faster, finish before the test runner's deadline) and windows-x86_64 debug happens to squeak through, but the substance is the same regression on every platform.

Root cause

Commit c2e391d5 "remove short hickory timeout" deletes this block from op_dns_resolve:

// REMOVED in this PR:
if cancel_rid.is_some() {
  opts.timeout = std::time::Duration::from_secs(1);
  opts.attempts = 1;
}

That block was added by #33580 (now merged on main, which is why this PR's merge-from-main brought in the cancel test) specifically because core.tryClose(cancelRid) aborts the awaited op but does not actually abort hickory's background resolver task — hickory keeps retrying for timeout × attempts after cancel. With the default config (5s × 4 = 20s), resolver.cancel() returns control to JS quickly but the resolver burns CPU/sockets until natural timeout, which is why this test runs for 20s and then trips the per-test deadline.

The PR description says "single ANY query replacing the 11-parallel-query workaround" — that's the right substance. But removing the cancel-time timeout fallback is unrelated to that change and is what regressed cancel.

Secondary smell

Independent of the above: the new retry loop in #query catches Interrupted and continues to the next attempt:

} catch (e) {
  if (e instanceof Deno.errors.Interrupted || e instanceof Deno.errors.TimedOut) {
    if (attempt < tries - 1) continue;
    code = "ETIMEOUT";
  }

Each loop iteration creates a new cancelRid, so resolver.cancel() only closes the rid that was registered when cancel() ran — the next iteration's fresh rid isn't in #cancelRids yet. Result: cancel aborts the current attempt but the loop happily starts a new query with a new rid that nothing has signaled. Pragmatically this gets swallowed by the timeout fallback fix above, but it's worth distinguishing Interrupted (don't retry — user cancelled) from TimedOut (do retry — c-ares semantics).

Fix direction: restore the cancel-handle timeout shrinking in op_dns_resolve (or split Interrupted / TimedOut handling so Interrupted exits the loop). Either alone should turn the cancel test green; doing both is cleanest.

Substance of the resolveAny rewrite is still right (single ANY query, per-record recordType dispatch, validateMaxTimeout clamp). The regression is scoped to the cancel-timeout deletion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants