fix(ext/node): implement dns Resolver cancel() and timeout support#33580
fix(ext/node): implement dns Resolver cancel() and timeout support#33580bartlomieju wants to merge 8 commits intomainfrom
Conversation
Implements `ChannelWrap.prototype.cancel()` which aborts all pending DNS queries with `ECANCELLED` error code. Also wires up the `timeout` option in the Resolver constructor to race DNS queries against a timer, returning `ETIMEOUT` on expiry. Passes timeout to the underlying hickory resolver so the DNS query completes promptly. Enables 4 more Node.js compat tests: - parallel/test-dns-cancel-reverse-lookup.js - parallel/test-dns-channel-cancel.js - parallel/test-dns-channel-cancel-promise.js - parallel/test-dns-channel-timeout.js
- Handle IPv4-mapped IPv6 addresses (e.g. ::ffff:1.2.3.4) in getHostByAddr by converting the dotted-quad tail to two 16-bit hex groups before building the ip6.arpa reverse name. - Replace dead code guard in assert_success_err(0) with debug_assert.
These tests involve DNS queries to non-responsive local UDP servers. After the JS-level timeout fires, the underlying hickory DNS resolver keeps the process alive until its internal timeout expires (~15s total). The Linux CI test runner has a 10s timeout, causing flaky failures. Disabled on Linux until the underlying op can be properly cancelled.
fibibot
left a comment
There was a problem hiding this comment.
The substantive logic looks right — cancel() walks #pendingQueries and dispatches ECANCELLED to each pending req.oncomplete, the per-query add/has-and-delete pattern around every queryX correctly suppresses late-arriving completions after cancel, and the Promise.race-against-setTimeout pattern surfaces ETIMEOUT when the resolver runs long. The Rust-side opts.timeout = Duration::from_millis(ms.max(1)) plumbing into hickory is the right place to actually bound the underlying query. Glad to see the assert_success_err debug_assert! cleanup made it in here too — exactly the simplification I suggested on #33579.
CI is red on lint though (3 platforms, format-only — the opts.timeout = line in ext/net/ops.rs needs to wrap onto a single line). Per my own gate I shouldn't APPROVE while CI is red — going COMMENT until that lands.
Three substantive notes inline (none blocking the format fix; the format itself is the only CI blocker).
| opts.edns0 = true; | ||
| } | ||
| if let Some(ms) = timeout_ms { | ||
| opts.timeout = |
There was a problem hiding this comment.
Format blocker — rustfmt wants this on one line (current diff has opts.timeout = and Duration::from_millis(...) on separate lines):
opts.timeout = std::time::Duration::from_millis(ms.max(1));Fails on all three lint platforms: https://github.com/denoland/deno/actions/runs/25009734436/job/73242045807. ./tools/format.js will auto-fix.
Separate substantive note while we're here: forcing opts.attempts = 1 whenever timeout_ms is set diverges from Node's Resolver({ timeout, tries }) contract — Node treats tries and timeout as independent (timeout per attempt, multiplied by tries). With this PR, any caller passing both gets a single attempt regardless of tries. The four target tests don't exercise the tries > 1 + timeout combo so they pass, but a faithful Node port would respect both. Worth a TODO comment here at minimum.
| dnsPromise = Promise.race([ | ||
| dnsPromise, | ||
| new Promise<never>((_resolve, reject) => | ||
| setTimeout(() => reject("ETIMEOUT"), this.#timeout) |
There was a problem hiding this comment.
Same setTimeout cleanup concern from #33571 — the timer isn't clearTimeout'd when the DNS query resolves first. The race resolves with the resolve-side, but the timer keeps the event loop alive until it fires. For a long-lived process (test runner, cluster worker, server) this just spins extra timers; for short-lived CLI scripts it can extend process exit by this.#timeout ms after work is done.
Fix: capture the timer id and clear it from a .finally on dnsPromise:
let timerId: number | undefined;
if (this.#timeout >= 0) {
dnsPromise = Promise.race([
dnsPromise,
new Promise<never>((_resolve, reject) => {
timerId = setTimeout(() => reject("ETIMEOUT"), this.#timeout);
}),
]);
}
// ... after await:
if (timerId !== undefined) clearTimeout(timerId);Not blocking — and rejecting with the bare string "ETIMEOUT" (rather than new Error("ETIMEOUT")) is fine since the catch checks e === "ETIMEOUT" by identity, just unusual. A Symbol-based sentinel would be slightly cleaner but the string works.
| #[derive(FromV8)] | ||
| pub struct ResolveDnsOption { | ||
| name_server: Option<NameServer>, | ||
| timeout_ms: Option<u64>, |
There was a problem hiding this comment.
Adding timeout_ms: Option<u64> to the ResolveDnsOption FromV8 struct also extends the option set that Deno.resolveDns(query, type, options) accepts — the same struct is used for the public API path. The TypeScript declaration for Deno.ResolveDnsOptions in cli/tsc/dts/lib.deno.ns.d.ts:4978 is not updated, so the field is callable but invisible to TS users (Deno.resolveDns("x", "A", { timeout_ms: 1000 } as any) would now silently work).
Not blocking — and arguably useful — but worth deciding deliberately: either add the field (in camelCase: timeout) to the public Deno.ResolveDnsOptions interface and document it, or rename the Rust field to something non-public-looking and route the polyfill through a separate dedicated op. As written, it's a soft public-API extension that future-you might forget about.
Replace Promise.race timeout approach with deno_core cancel handles. When the timeout fires, the cancel handle is closed which aborts the underlying hickory DNS resolver future immediately, allowing the process to exit promptly instead of waiting ~15s for hickory's internal timeout. Remove the timeout_ms pass-through to hickory resolver opts since cancellation handles this properly now. Re-enable all 4 DNS cancel/timeout tests on Linux.
fibibot
left a comment
There was a problem hiding this comment.
The cancel-handle approach in e4a548fd is much cleaner than my clearTimeout suggestion — using core.createCancelHandle() + core.tryClose(cancelRid) actually aborts the underlying op rather than just dropping the result, and the Deno.errors.Interrupted → "ETIMEOUT" mapping in the catch is the right shape. Also nice: dropped the timeout_ms field from the public ResolveDnsOption struct entirely (addressed my soft-API concern), and the EXPECTED_VIOLATIONS bump to 3 accounts for the new core.tryClose calls.
But CI went red — test node_compat debug linux-aarch64 is failing on test-dns-cancel-reverse-lookup.js (flaky on runs 0+1, then 10002ms timeout, then fail). Detail inline.
Also worth checking: the linux: false skip was removed for all four enrolled tests, but the cancel tests don't exercise the timeout cancel-handle path — they call resolver.cancel() (no timeout set), which means cancelRid is undefined and the underlying op runs to hickory's natural timeout. The user-facing callback fires ECANCELLED via the cancel() loop, but the still-running op keeps the test's event loop alive. That's likely what's producing the 10s hang.
| cancel() { | ||
| notImplemented("cares.ChannelWrap.prototype.cancel"); | ||
| for (const req of this.#pendingQueries) { | ||
| req.oncomplete("ECANCELLED", []); |
There was a problem hiding this comment.
cancel() only walks #pendingQueries and fires req.oncomplete("ECANCELLED", []) synchronously — it doesn't actually abort the underlying op_dns_resolve. For the timeout path the cancel handle does abort the op, but here (no timeout set) the op stays alive until hickory's own timeout (~5s default × attempts) fires.
Failure pattern in CI on linux-aarch64 (https://github.com/denoland/deno/actions/runs/25011070179/job/73247390033):
[Warning] node_compat::parallel::test-dns-cancel-reverse-lookup.js was flaky on run 0
[Warning] ... was flaky on run 1
test test-dns-cancel-reverse-lookup.js has been running for more than 60 seconds
test test-dns-cancel-reverse-lookup.js ... fail (10002ms)
The user-facing callback fires ECANCELLED immediately, but the orphaned op_dns_resolve keeps the test's event loop alive until hickory times out. The 10002ms pretty clearly aligns with hickory's default 5s × 2 attempts.
Fix: extend the cancel-handle pattern to all queries, not just the timeout path. Roughly — give every #query call a cancelRid up front, store it on the req (or in a parallel map), and have cancel() core.tryClose each one in addition to (or instead of) firing oncomplete. Something like:
#cancelRids: Map<QueryReqWrap, number> = new Map();
// in #query, always create a cancelRid:
const cancelRid = core.createCancelHandle();
this.#cancelRids.set(req, cancelRid); // populated by caller
// cancel():
for (const [req, rid] of this.#cancelRids) {
core.tryClose(rid);
req.oncomplete("ECANCELLED", []);
}
this.#cancelRids.clear();
this.#pendingQueries.clear();The op should then return Deno.errors.Interrupted and the existing catch in #query will short-circuit, so the test's event loop can drain.
(There may be additional issues if req.oncomplete("ECANCELLED", []) doesn't get mapped through dnsException to set .code/.syscall/.hostname correctly in the test's assertion — worth verifying that string code passes through the existing dispatch, since dnsException is typically built around integer error codes.)
| "parallel/test-dns-cancel-reverse-lookup.js": {}, | ||
| "parallel/test-dns-channel-cancel.js": {}, | ||
| "parallel/test-dns-channel-cancel-promise.js": {}, | ||
| "parallel/test-dns-channel-timeout.js": {}, |
There was a problem hiding this comment.
Removing linux: false from these four entries is the right end-state, but at least test-dns-cancel-reverse-lookup.js is still failing on linux-aarch64 in this commit (10s hang → fail per the inline above). Either restore the platform skip for that one entry temporarily, or block this PR on the underlying fix. The other three (channel-cancel, channel-cancel-promise, channel-timeout) appear to pass — channel-timeout shows up green in the test log (test-dns-channel-timeout.js ... ok (141ms)).
lunadogbot
left a comment
There was a problem hiding this comment.
LGTM -- read the cancel/timeout flow end-to-end:
#pendingQueries: Set<QueryReqWrap>is added/checked-then-deleted at everyquery*entrypoint and inside every.then(...). Thecancel()iterates and callsreq.oncomplete('ECANCELLED', []), thenclear()s. The double-fire race is handled correctly: a late-resolving promise hitsif (!this.#pendingQueries.has(req)) returnafter cancel cleared the set, so the original oncomplete is skipped.- The timeout path sets up a
core.createCancelHandle()and asetTimeoutthat callscore.tryClose(cancelRid). The op throwsDeno.errors.Interrupted, the catch maps it tocode = 'ETIMEOUT'. Thefinallyclears the timer and closes the rid in either order (timer-fired-first vs op-completed-first), and thecancelRid !== undefinedguard prevents a double-close. - The new
codetype isstring | number. Verified againstdnsExceptionininternal/errors.ts:376-- it explicitly handles both: iftypeof code === 'number'it doesgetSystemErrorName, otherwise the string code passes through toex.code. So'ECANCELLED'and'ETIMEOUT'produce the right shape on the user-facing error. - The IPv4-mapped IPv6 expansion at line 678+ for reverse-lookup correctly turns
::ffff:1.2.3.4into the two 16-bit hex groups.
One tiny note: in ext/node/ops/dns.rs, swapping if code == 0 { return DnsError::Io(...) } for debug_assert!(code != 0, ...) is a strict improvement in debug builds, but in release builds the unreachable case becomes a silent fall-through to whatever match code returns next (probably an arm that doesn't match -> generic UNKNOWN). Worth either a comment confirming the caller really can't pass 0 (the function name assert_success_err suggests that's the contract), or unreachable!() instead of debug_assert!() so release builds at least panic with a clear message rather than producing a misleading mapping. Not blocking.
- Track cancel handles (cancelRids) for each DNS query so cancel() can abort in-flight op_dns_resolve operations via core.tryClose - Use a short hickory resolver timeout (1s, 1 attempt) when a cancel handle is present, so background connection tasks clean up quickly after cancellation instead of hanging for the default 5s×2 attempts
Summary
ChannelWrap.prototype.cancel()which aborts all pending DNS queries withECANCELLEDerror code.timeoutoption in the Resolver constructor to race DNS queries against a timer, returningETIMEOUTon expiry.parallel/test-dns-cancel-reverse-lookup.js,parallel/test-dns-channel-cancel.js,parallel/test-dns-channel-cancel-promise.js,parallel/test-dns-channel-timeout.jsTest plan
./x test-compat test-dns-channel— all 3 pass./x test-compat test-dns-cancel-reverse-lookuppasses