fix(ext/node): dns resolveAny with real ANY query, retry/maxTimeout support#33577
fix(ext/node): dns resolveAny with real ANY query, retry/maxTimeout support#33577bartlomieju wants to merge 12 commits intomainfrom
Conversation
…rameter Implements `ChannelWrap.prototype.getHostByAddr` which backs `dns.reverse()` and `dnsPromises.reverse()`. The method converts IP addresses to reverse DNS format (`.in-addr.arpa` for IPv4, `.ip6.arpa` for IPv6) and queries PTR records. Also fixes `dns.lookup()` with `family: 6` returning no results by replacing the `GaiResolver` (which always used `AF_UNSPEC`) with a direct `getaddrinfo` call that passes through the address family hint. Enables `internet/test-dns-ipv4.js` and `internet/test-dns-ipv6.js` Node.js compat tests.
Implements `ChannelWrap.prototype.cancel()` which aborts all pending DNS queries with `ECANCELLED` error code. Also wires up the `timeout` option in the Resolver constructor to race DNS queries against a timer, returning `ETIMEOUT` on expiry. Enables 4 more Node.js compat tests: - parallel/test-dns-cancel-reverse-lookup.js - parallel/test-dns-channel-cancel.js - parallel/test-dns-channel-cancel-promise.js - parallel/test-dns-channel-timeout.js
- Wrap Windows `SockAddr::try_init` and its raw pointer ops in an
`unsafe` block to fix E0133 compile errors on Windows.
- Fix IPv6 `::` expansion in `getHostByAddr` for addresses like `::1`
or `1::` where `split(':')` produces multiple empty strings. Only
expand once on the first empty entry.
The `target` field from `Deno.resolveDns("...", "SRV")` returns FQDNs
with trailing dots. Apply `fqdnToHostname()` to strip them, matching
Node.js behavior.
Enables 2 more Node.js compat tests:
- parallel/test-dns-resolvesrv.js
- parallel/test-dns-resolvesrv-econnrefused.js
Requires --expose-internals and internalBinding('cares_wrap') which
is not supported in Deno.
…upport - Replace queryAny's 11 separate DNS queries with a single ANY query by adding ANY record type support to op_dns_resolve. For ANY queries, use each record's actual type for formatting and include a record_type field to distinguish untagged string variants (A vs NS vs PTR etc). - Implement retry logic in DNS resolver: timeout doubles on each retry, capped by the new maxTimeout option. - Add maxTimeout option validation to the Resolver constructor. Enables 3 more Node.js compat tests: - parallel/test-dns-resolveany.js - parallel/test-dns-resolveany-bad-ancount.js - parallel/test-dns-resolver-max-timeout.js
fibibot
left a comment
There was a problem hiding this comment.
Substance is mostly right — single ANY query replacing the 11-parallel-query workaround is a real improvement, and the per-record recordType field gives JS the type info it needs to dispatch on string-typed variants (A vs NS vs PTR vs CNAME). validateMaxTimeout correctly clamps to [0, 2^31-1] and the retry loop's timeout * 2^attempt capped by maxTimeout mirrors c-ares.
A few things to clean up before this lands:
-
Dependency stale: body says "Depends on #33571" but #33571 was closed in favor of #33579 (already merged). Rebasing onto current main will pull in #33579's
dns.reverse()+dns.lookup({ family })work, which already touchescares_wrap.ts. Conflicts likely. -
#pendingQueriesfield overlap with #33580: that PR also adds#pendingQueries: Set<QueryReqWrap>toChannelWrap. One of the two will need to rebase. -
Same
Promise.race + setTimeoutissues I flagged on #33580:- Timer leak: when DNS resolves first, the
setTimeoutfires for nothing later but keeps the event loop alivecurrentTimeoutms. - Orphan op: ETIMEOUT just rejects the race — the underlying
op_dns_resolvekeeps running. Withtries=4andtimeout=1000, a slow resolver can spawn 4 concurrent op invocations all racing to complete while JS thinks they timed out. - #33580's latest commit switches to
core.createCancelHandle()+core.tryClose(cancelRid)to actually abort the op — that approach should be lifted into here too once #33580 merges.
- Timer leak: when DNS resolves first, the
-
Field-name serialization (verify locally — flagging the ambiguity): the new
pub record_type: Option<String>field inDnsRecordWithTtlis read in JS asentry?.recordType(camelCase). The existingdata/ttlfields are accessed as-is because they're already lowercase. If#[derive(ToV8)]doesn't auto-camelCase fields, JS would need to readentry?.record_type(snake_case) or the dispatch switch silently falls through andrecordsends up empty. The PR body says "all 11 tests pass locally" so either ToV8 does camelCase, or the test isn't exercising dispatch — worth a quickconsole.log(entry)to confirm.
CI not yet complete (56/n passing, no failures yet). I'll re-confirm once it lands and the dependencies above are sorted out.
|
This PR depends on #33580 (cancel handle support for DNS ops) being merged first. The tests registered here ( Will rebase after #33580 lands. |
- Revert queryAny to the Promise.allSettled approach (individual per-type queries) because hickory-resolver doesn't reliably handle ANY queries against simple UDP mock DNS servers (TCP fallback hangs) - Remove timeout_ms/short hickory timeout override -- let the JS-level cancel handle be the sole timeout mechanism to avoid race conditions between hickory's internal timeout and the cancel handle - Handle Deno.errors.TimedOut in addition to Interrupted for timeout detection - Remove test-dns-resolveany, test-dns-resolveany-bad-ancount, and test-dns-resolver-max-timeout from config.jsonc (all require real ANY query support or resolveAny with mock servers that only respond to specific record types)
The earlier revert to Promise.allSettled was wrong -- hickory handles ANY queries fine against mock UDP servers. The actual issue was the forced short hickory timeout (opts.timeout=1s, opts.attempts=1) which has been removed. Restore the single ANY query approach which: - Sends one DNS query instead of 11 individual typed queries - Correctly handles TTL, record ordering, and all record types - Works with mock servers that expect exactly N messages Re-add test entries: test-dns-resolveany, test-dns-resolveany-bad-ancount, test-dns-resolver-max-timeout. All three pass locally.
fibibot
left a comment
There was a problem hiding this comment.
Walking back my prior approval — the latest two commits regressed parallel/test-dns-channel-cancel.js. CI shows 9 consistent debug-shard failures (linux x86_64 + aarch64 and macos-aarch64) with the test taking ~20s and timing out:
test node_compat::parallel::test-dns-channel-cancel.js ... fail (20023ms)
1 failed of 913
Release builds pass (faster, finish before the test runner's deadline) and windows-x86_64 debug happens to squeak through, but the substance is the same regression on every platform.
Root cause
Commit c2e391d5 "remove short hickory timeout" deletes this block from op_dns_resolve:
// REMOVED in this PR:
if cancel_rid.is_some() {
opts.timeout = std::time::Duration::from_secs(1);
opts.attempts = 1;
}That block was added by #33580 (now merged on main, which is why this PR's merge-from-main brought in the cancel test) specifically because core.tryClose(cancelRid) aborts the awaited op but does not actually abort hickory's background resolver task — hickory keeps retrying for timeout × attempts after cancel. With the default config (5s × 4 = 20s), resolver.cancel() returns control to JS quickly but the resolver burns CPU/sockets until natural timeout, which is why this test runs for 20s and then trips the per-test deadline.
The PR description says "single ANY query replacing the 11-parallel-query workaround" — that's the right substance. But removing the cancel-time timeout fallback is unrelated to that change and is what regressed cancel.
Secondary smell
Independent of the above: the new retry loop in #query catches Interrupted and continues to the next attempt:
} catch (e) {
if (e instanceof Deno.errors.Interrupted || e instanceof Deno.errors.TimedOut) {
if (attempt < tries - 1) continue;
code = "ETIMEOUT";
}Each loop iteration creates a new cancelRid, so resolver.cancel() only closes the rid that was registered when cancel() ran — the next iteration's fresh rid isn't in #cancelRids yet. Result: cancel aborts the current attempt but the loop happily starts a new query with a new rid that nothing has signaled. Pragmatically this gets swallowed by the timeout fallback fix above, but it's worth distinguishing Interrupted (don't retry — user cancelled) from TimedOut (do retry — c-ares semantics).
Fix direction: restore the cancel-handle timeout shrinking in op_dns_resolve (or split Interrupted / TimedOut handling so Interrupted exits the loop). Either alone should turn the cancel test green; doing both is cleanest.
Substance of the resolveAny rewrite is still right (single ANY query, per-record recordType dispatch, validateMaxTimeout clamp). The regression is scoped to the cancel-timeout deletion.
Depends on #33571.
Summary
queryAny's 11 separate DNS queries with a single ANY query by addingRecordType::ANYsupport toop_dns_resolve. For ANY queries, each record's actual type is used for formatting, and arecord_typefield is included to let the JS layer distinguish untagged string variants (A vs NS vs PTR vs CNAME).maxTimeoutoption.maxTimeoutoption validation to theResolverconstructor (valid range:0to2^31-1).parallel/test-dns-resolveany.jsparallel/test-dns-resolveany-bad-ancount.jsparallel/test-dns-resolver-max-timeout.jsTest plan
./x test-compat test-dns-resolveany— both pass (resolveany + bad-ancount)./x test-compat test-dns-resolver-max-timeoutpasses