Skip to content

Add support for esp32 network driver scanning for access points#1165

Open
UncleGrumpy wants to merge 5 commits intoatomvm:release-0.7from
UncleGrumpy:esp32_wifi_scan
Open

Add support for esp32 network driver scanning for access points#1165
UncleGrumpy wants to merge 5 commits intoatomvm:release-0.7from
UncleGrumpy:esp32_wifi_scan

Conversation

@UncleGrumpy
Copy link
Copy Markdown
Collaborator

@UncleGrumpy UncleGrumpy commented May 24, 2024

Add network:wifi_scan/0,1 to esp32 network driver, giving the ability for devices configured for "station mode" (or with "station + access point mode") to scan for available access points.

note: these changes depend on PR #1181

Closes #2024

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

@UncleGrumpy UncleGrumpy changed the base branch from main to release-0.6 May 24, 2024 00:58
@UncleGrumpy UncleGrumpy force-pushed the esp32_wifi_scan branch 2 times, most recently from 58cf1f8 to cdadc33 Compare June 6, 2024 01:14
@petermm
Copy link
Copy Markdown
Contributor

petermm commented Jun 11, 2024

I struggled here:

I started out empty example and tried:

    scan = :network.wifi_scan()
    IO.inspect(scan)

This crashes with: {noproc,{gen_server,call,[network,get_config]}} - looks like there is a code path for handling the error but it's never reached

Then I read docs, about having to start the network, and tried:

 config = [
      {:sta,
       [
         {:ssid, "Wokwi-GUEST"},
         {:psk, ""},
         {:connected,
          fn ->
            IO.inspect("network CONNECTED")
            :ok
          end}
       ]}
    ]

    case :network.start(config) do
      {:ok, _pid} ->
        IO.inspect("network started")
      error ->
        :io.put_chars("\nAn error occurred starting network:")
        :erlang.display(error)
    end

    scan = :network.wifi_scan()
    IO.inspect(scan)

which crashed with: wifi:sta_scan: STA is connecting, scan are not allowed!

delaying things and scanning after connection is done made it further to a crash:

CRASH 
======
pid: <0.1.0>

Stacktrace:
undefined

cp: #CP<module: 11, label: 31, offset: 26>

x[0]: error
x[1]: badarg
x[2]: error

Stack 
------

#CP<module: 11, label: 32, offset: 26>
<<"0">>
#CP<module: 11, label: 24, offset: 20>
#CP<module: 10, label: 11, offset: 11>
{0,[{none,[{rssi,undefined},{authmode,undefined},{channel,undefined}]}]}

It also seems that connecting to a network not available, disallows scanning with: wifi:sta_scan: STA is connecting, scan are not allowed!

Would be great seeing an example that could be used in CI eg:

  1. scan with no network configured
  2. connect known good network - scan
  3. config known unavailable network - scan
  4. probably some ap stuff
    etc.

I'll work on getting wokwi CI going so this can be covered by CI.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

UncleGrumpy commented Jun 11, 2024

I did not add an example yet, my plan was to write a demonstration of “roaming” for atomvm_examples. It is a little difficult to use until #1181 is merged, that will allow starting WiFi without immediately starting a connection. Scans will always fail when a connection is in progress, so your application will need to start a scan after obtaining an IP address, or before any connection is started (which is not possible until after #1181).

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

I wish your second example had a stacktrace! You can see the scan result did come back with no networks found… this can happen with default scans (the dwell time is very short by default - and can easily miss networks, in my experience). I think it’s just a bad match in your test that is crashing there.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Jun 11, 2024

I wish your second example had a stacktrace! You can see the scan result did come back with no networks found… this can happen with default scans (the dwell time is very short by default - and can easily miss networks, in my experience). I think it’s just a bad match in your test that is crashing there.

ohh the embarrassment lol - it's the IO.inspect call that crashes, will look into it - used to throwing anything at IO.inspect..

Makes sense with the other stuff, suppose some kind of WifiManager GenServer will materialize, while this PR is on driver primitives level..

I'll test #1181 and help land that first..

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

My plan precisely! The PRs I have already submitted will provide all the necessary low level functionality, but I would like to create a higher level network_manager module that can simplify configuration and orchestration.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

As far as “dwell” time for the scan, I have found between 300-500ms will find all of the available networks, with the default (120ms) I do notice networks being frequently missed.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

it's the IO.inspect call that crashes, will look into it - used to throwing anything at IO.inspect..

That does sound like a bug, from my limmited understanding of Elixir you should be able to give it just about anything, much like erlang:display/1.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

{0,[{none,[{rssi,undefined},{authmode,undefined},{channel,undefined}]}]}

You did point me to a problem here, I started testing more boards and realized that even with maximum dwell time no network are being found. I cherry-picked these commits from a different local branch and must have missed something.

@UncleGrumpy UncleGrumpy marked this pull request as draft June 13, 2024 15:10
@UncleGrumpy UncleGrumpy force-pushed the esp32_wifi_scan branch 2 times, most recently from 480b056 to 50a78c7 Compare August 19, 2024 19:18
@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

When I split up the working branch I was testing new network features on I mistakenly submitted this PR first, but in order to use the scan function PR #1181 needs to be merged first, and this will need to be rebased.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

UncleGrumpy commented Aug 19, 2024

{0,[{none,[{rssi,undefined},{authmode,undefined},{channel,undefined}]}]}

I did find a bug that would cause the results to be empty if only a single network was found, during most of my testing I had multiple networks, so this didn't get caught. @petermm, thanks for testing and helping me correct this!

@UncleGrumpy UncleGrumpy marked this pull request as ready for review August 19, 2024 19:43
UncleGrumpy added a commit to UncleGrumpy/AtomVM that referenced this pull request Nov 20, 2024
For more fine grained connection management in applications the driver can now
be started, without perfoming an inital connection, by the use of the key
`managed` in the STA configuration.

Adds network:sta_connect/0,1 to allow connecting to an access point after the
driver has been started in STA or STA+AP mode. If the function is used without
parameters a connection to the last configured access point will be started.

Adds network:sta_disconnect/0 to disconnect a station from an access point.

The station mode disconnected callback now maintains the default behavior of
reconnecting to the last access point if the connection is lost, but if the
user defines a custom callback the automatic re-connection will not happen,
allowing for users to take advantage of scan results or some other means to
determine when and which access point to associate with.

The combination of the use of a disconnected callback and `managed` mode allow
for the use of `network:wifi_scan/0,1` (PR atomvm#1165), since the wifi must not be
connected to a station when perfoming a scan and the current implementation
always starts a connection immediatly and always reconnects when disconnected.

Signed-off-by: Winford <winford@object.stream>
@UncleGrumpy UncleGrumpy changed the base branch from release-0.6 to main November 20, 2024 07:05
UncleGrumpy added a commit to UncleGrumpy/AtomVM that referenced this pull request Nov 21, 2024
For more fine grained connection management in applications the driver can now
be started, without perfoming an inital connection, by the use of the key
`managed` in the STA configuration.

Adds network:sta_connect/0,1 to allow connecting to an access point after the
driver has been started in STA or STA+AP mode. If the function is used without
parameters a connection to the last configured access point will be started.

Adds network:sta_disconnect/0 to disconnect a station from an access point.

The station mode disconnected callback now maintains the default behavior of
reconnecting to the last access point if the connection is lost, but if the
user defines a custom callback the automatic re-connection will not happen,
allowing for users to take advantage of scan results or some other means to
determine when and which access point to associate with.

The combination of the use of a disconnected callback and `managed` mode allow
for the use of `network:wifi_scan/0,1` (PR atomvm#1165), since the wifi must not be
connected to a station when perfoming a scan and the current implementation
always starts a connection immediatly and always reconnects when disconnected.

Signed-off-by: Winford <winford@object.stream>
@UncleGrumpy UncleGrumpy marked this pull request as draft November 21, 2024 21:38
UncleGrumpy added a commit to UncleGrumpy/AtomVM that referenced this pull request Dec 22, 2024
For more fine grained connection management in applications the driver can now
be started, without perfoming an inital connection, by the use of the key
`managed` in the STA configuration.

Adds network:sta_connect/0,1 to allow connecting to an access point after the
driver has been started in STA or STA+AP mode. If the function is used without
parameters a connection to the last configured access point will be started.

Adds network:sta_disconnect/0 to disconnect a station from an access point.

The station mode disconnected callback now maintains the default behavior of
reconnecting to the last access point if the connection is lost, but if the
user defines a custom callback the automatic re-connection will not happen,
allowing for users to take advantage of scan results or some other means to
determine when and which access point to associate with.

The combination of the use of a disconnected callback and `managed` mode allow
for the use of `network:wifi_scan/0,1` (PR atomvm#1165), since the wifi must not be
connected to a station when perfoming a scan and the current implementation
always starts a connection immediatly and always reconnects when disconnected.

Signed-off-by: Winford <winford@object.stream>
@UncleGrumpy UncleGrumpy requested review from bettio and removed request for bettio March 2, 2026 09:05
@petermm

This comment was marked as outdated.

@petermm

This comment was marked as outdated.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 5, 2026

Looking good! - did a full sim test, and we fail on all esp32s2, no crash just error - so will investigate that later on real hw - probably just low mem, which code then seemed to handle, eg. it didn't crash (think solution is to lower results or something, from distant memory)

let's add a check and network:stop() to simtest test_wifi_scan.erl

-module(test_wifi_scan).

-export([start/0]).

start() ->
    {ok, _Pid} = network:start([{sta, [managed]}]),
    {ok, {Num, Networks}} = network:wifi_scan(),
    io:format("network:wifi_scan found ~p networks.~n", [Num]),
    lists:foreach(
        fun(
            _Network =
                #{
                    ssid := SSID,
                    rssi := DBm,
                    authmode := Mode,
                    bssid := BSSID,
                    channel := Number
                }
        ) ->
            io:format(
                "Network: ~p, BSSID: ~p, signal ~p dBm, Security: ~p, channel "
                "~p~n",
                [SSID, binary:encode_hex(BSSID), DBm, Mode, Number]
            )
        end,
        Networks
    ),
    true = lists:any(fun(#{ssid := SSID}) -> SSID =:= <<"Wokwi-GUEST">> end, Networks),
    ok = network:stop(),
    ok.

LLM review: feel free to pick and choose! think most importantly is the ssid data type charlist vs binary?

PR #1165 Review — Add network:wifi_scan/0,1 to ESP32 network driver

Branch: pr/1165
Commits reviewed: feef99b08 (Add network:wifi_scan/0,1) and 44e29c15a (Document new wifi functions)


Must Fix

1. SSIDs should be returned as binaries, not charlists (network_driver.c)

Severity: critical
SSIDs are currently built with term_from_string which creates Erlang charlists. This causes two problems:

  • Elixir compatibility: Elixir uses binaries as default strings. Consuming SSIDs as charlists from Elixir is awkward (~c"MyNetwork" instead of "MyNetwork"). BSSIDs are already returned as binaries, so this is inconsistent.
  • Heap sizing bug: The per-AP heap estimate uses SSID_MAX_SIZE (33) which is the raw byte count. Charlists require TERM_STRING_SIZE(32) (~64 terms) — roughly double. This under-allocation can cause stack-heap overwrite / memory corruption with longer SSIDs.
  • Memory efficiency: Binaries use significantly less heap than charlists (header + word-aligned data vs 2 terms per byte), which matters on memory-constrained chips like esp32s2.

Note: the test file test_wifi_scan.erl already assumes SSIDs are binaries (SSID =:= <<"Wokwi-GUEST">>) — this match will fail against the current charlist return.

Fix: Change term_from_string to term_from_literal_binary for SSIDs in wifi_ap_records_to_list_maybe_gc, and update the heap sizing to use TERM_BINARY_HEAP_SIZE(SSID_MAX_SIZE - 1):

// SSIDs — change from charlist to binary:
ssid = term_from_literal_binary(ap_records[i].ssid, ssid_size, heap, global);

// Heap sizing — update ap_data_size:
size_t ap_data_size = TUPLE_SIZE(2) + term_map_size_in_terms(5)
    + TERM_BINARY_HEAP_SIZE(SSID_MAX_SIZE - 1) + TERM_BINARY_HEAP_SIZE(BSSID_SIZE);

If charlists are preferred for backward compatibility, the heap sizing must still be fixed:

size_t ap_data_size = TUPLE_SIZE(2) + term_map_size_in_terms(5)
    + TERM_STRING_SIZE(SSID_MAX_SIZE - 1) + TERM_BINARY_HEAP_SIZE(BSSID_SIZE);

2. Double-wrapping {error, …} in wifi_scan (network_driver.c)

Severity: high
Several call sites in wifi_scan create an error tuple with port_create_error_tuple(ctx, BADARG_ATOM) and then pass the result to send_scan_error_reply, which internally calls port_create_error_tuple again. This produces {error, {error, badarg}} instead of the intended {error, badarg}.

Affected lines (in wifi_scan):

term error = port_create_error_tuple(ctx, BADARG_ATOM);
send_scan_error_reply(ctx, pid, ref, error);  // double-wraps!

Fix: Pass the bare reason atom directly:

send_scan_error_reply(ctx, pid, ref, BADARG_ATOM);

3. Use-after-free if esp_event_handler_unregister fails (network_driver.c)

Severity: high
In send_scan_results, if esp_event_handler_unregister fails, data (ScanClientData) is still freed. The still-registered handler can then fire again and access freed memory. This affects multiple paths:

  • Success path at end of send_scan_results
  • calloc failure for ap_records (line ~443)
  • esp_wifi_scan_get_ap_records failure (line ~465–466)
  • esp_wifi_scan_start failure in wifi_scan (line ~1488)

Fix: Only free data when esp_event_handler_unregister succeeds. If it fails, intentionally leak data (the lesser evil) and log the error.

4. Missing defensive clamp on num_results after esp_wifi_scan_get_ap_records (network_driver.c)

Severity: medium
esp_wifi_scan_get_ap_records(&num_results, ap_records) updates num_results by reference. If it were to return a value larger than what was allocated for ap_records, subsequent iteration in wifi_ap_records_to_list_maybe_gc would read out of bounds.

Fix: Save the allocated count before the call and clamp afterward:

uint16_t max_results = num_results;
esp_err_t err = esp_wifi_scan_get_ap_records(&num_results, ap_records);
if (num_results > max_results) {
    num_results = max_results;
}

Should Fix

5. Scan result misattribution race (network.erl / network_driver.c)

Severity: medium
All scan requests share the same port Ref. If a scan is cancelled and a new one is started, late results from the old scan arrive with the same Ref and can be delivered to the new scan_receiver.

Fix: Generate a per-scan ref in handle_call({scan,...}), send it to the port, and match on it in handle_info to ignore stale results.

6. Inconsistent cancel message atom (network.erl)

Severity: low
On cancellation, Pid receivers get {scan_done, {error, canceled}} but on success they get {scan_results, Results}. The mismatched wrapper atoms (scan_done vs scan_results) make it harder for receivers to handle both cases uniformly.

Fix: Use {scan_results, {error, canceled}} for Pid receivers on cancellation.


Low Priority

7. Typo in docs: "per-chanel" (network.erl)

Severity: low
Line 478: per-chanelper-channel

8. Typo in docs: "an not" (network.erl)

Severity: low
Line 489: an not in the processand not in the process


Positive Notes

  • Dynamic gen_server:call timeout calculation based on channel count and dwell time is well designed — prevents timeouts on 5 GHz capable chips.
  • Defensive calloc/malloc failure handling throughout the C driver is solid for embedded targets.
  • Immediate freeing of scan_config after esp_wifi_scan_start is correct per ESP-IDF semantics.
  • Good documentation and examples covering both blocking and callback scan modes.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 16, 2026

Made a potential fix if you want it:

https://github.com/petermm/AtomVM/tree/fix/scan-cleanup-lifetime

PR Review: Fix ESP32 wifi scan handling

Commit reviewed: 9d13bb8d
Files reviewed:

  • libs/avm_network/src/network.erl
  • src/platforms/esp32/components/avm_builtins/network_driver.c

Verdict

Merge-ready.

The patch is focused, the behavior change is clear, and the final scope matches the bug being fixed.

Key Changes

1. Correct blocking reply behavior in network.erl

scan_reply_or_callback/2 now handles the blocking caller case explicitly as {Pid, Ref} and replies via gen_server:reply/2.

That is the correct behavior for network:wifi_scan/0, which waits on a synchronous gen_server:call. The updated handle_info/2 logic still clears scan_receiver before the reply is processed, preserving the intended race-avoidance behavior for subsequent scans.

2. Fetch AP records in the scan-done callback

network_driver.c now performs esp_wifi_scan_get_ap_num() and esp_wifi_scan_get_ap_records() in scan_done_handler(), while scan_results_task() is used only for Erlang term construction and mailbox delivery.

This is a good split of responsibilities:

  • the callback interacts with the ESP-IDF scan result APIs
  • the worker task handles the stack-heavy AtomVM term packaging path

3. Stronger scan ownership and teardown handling

The new scan lifecycle code improves cleanup and shutdown safety:

  • unregister_scan_if_owner() keeps ownership cleanup tied to successful handler unregistration
  • wait_for_scan_task() prevents shutdown or scan replacement from racing an in-flight worker
  • ScanClientData now owns the prefetched AP records and releases them centrally in scan_data_release()

Together, these changes make the scan path easier to reason about and reduce the chance of stale ownership, teardown races, or leaked scan result buffers.

Scope

The final patch only changes the two files that implement the fix:

  • network.erl for scan reply semantics
  • network_driver.c for scan result retrieval and cleanup ordering

That is the right scope for this commit.

Validation

  • test_wifi_scan passed on the target chips used for verification
  • libs/avm_network/src/network.erl compiles cleanly with erlc
  • git diff --check HEAD~1..HEAD is clean

Follow-Up Worth Doing

Not a blocker for this commit, but useful follow-up coverage would be:

  • a regression test for callback-based scan receivers (scan_done pid/fun path)
  • a cancellation test for back-to-back scans

Summary

This is a solid fix. It addresses reply delivery on the Erlang side, moves ESP-IDF scan result retrieval into the safer callback context, and tightens scan cleanup semantics without broadening the patch beyond the code directly involved.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

Made a potential fix if you want it:

Thanks, I will take a look. I unfortunately let myself get gas-lit into making some AI suggested changes; that I stupidly did a fixup too soon, and have my branch in a bad state. This was compounded by vscode not following along when I changed branches in the cli, so my changes were made to what should have been my good branch, not the working branch I changed too. I need to back up to where things started to go off track, and make sure this "fix" isn't still based on some point after things went terribly wrong, otherwise its a band-aid on top of a bandage, covering up a self inflicted wound.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 27, 2026

Code Review Findings: network:wifi_scan/0,1 (commits ea7bb4e, 848845d)

Contributor note: "The callback works now, but the direct call still shows a similar kind of corruption, but it is surfacing in a different place... I seem to be corrupting something..."


1. 🔴 Map key ordering is wrong (most likely corruption cause)

File: network_driver.cwifi_ap_records_to_list_maybe_gc() (lines 285–290)

AtomVM requires map keys to be in sorted Erlang term order (lexical for atoms). The current insertion order is wrong:

// CURRENT (wrong order):
term_set_map_assoc(ap_data, 0, ssid_atom_term, ssid_term);      // ssid
term_set_map_assoc(ap_data, 1, channel_atom_term, channel);      // channel
term_set_map_assoc(ap_data, 2, bssid_atom_term, bssid_term);     // bssid
term_set_map_assoc(ap_data, 3, authmode_atom_term, authmode);    // authmode
term_set_map_assoc(ap_data, 4, rssi_atom_term, rssi);            // rssi

Required order: authmode < bssid < channel < rssi < ssid

// FIX (correct sorted order):
term_set_map_assoc(ap_data, 0, authmode_atom_term, authmode);
term_set_map_assoc(ap_data, 1, bssid_atom_term, bssid_term);
term_set_map_assoc(ap_data, 2, channel_atom_term, channel);
term_set_map_assoc(ap_data, 3, rssi_atom_term, rssi);
term_set_map_assoc(ap_data, 4, ssid_atom_term, ssid_term);

This violates AtomVM's map invariant and will corrupt map operations on the receiving side. It explains why the callback path (which may not pattern-match the map the same way) appears to work while the direct/blocking path shows corruption.


2. 🟡 Test typo — erlang:resistererlang:register

File: test_wifi_scan.erl (line 113)

%% CURRENT (typo):
erlang:resister(stop_test, self()),

%% FIX:
erlang:register(stop_test, self()),

This will cause a runtime crash (undef) when the test runs.


3. 🟡 Heap size over-allocation (wasteful, not dangerous)

File: network_driver.csend_scan_results() (line 533)

size_t ap_data_size = (TERM_MAP_SIZE(5)
    + TERM_BINARY_HEAP_SIZE(SSID_MAX_SIZE) + SSID_MAX_SIZE    // double-counts
    + TERM_BINARY_HEAP_SIZE(BSSID_SIZE) + BSSID_SIZE           // double-counts
    + BOXED_INT_SIZE * 2);

TERM_BINARY_HEAP_SIZE(X) already includes storage for the binary data. The extra + SSID_MAX_SIZE and + BSSID_SIZE are redundant. Not a corruption source but wastes memory on a constrained device.


4. 🟡 esp_wifi_scan_get_ap_records buffer size mismatch

File: network_driver.c (line 503)

// CURRENT: num_results may be larger than the allocated buffer
err = esp_wifi_scan_get_ap_records(&num_results, ap_records);

The buffer is allocated with return_results entries, but num_results (from data->num_results) is passed. If num_results > return_results, the IDF could write past the buffer. Should pass &return_results instead.


5. 🟠 Pre-existing bugs in the same file (unrelated to scan)

5a. Wrong pointer cast in AP event handlers

// CURRENT (casts event_base, which is a string like "WIFI_EVENT"):
wifi_event_ap_staconnected_t *event = (wifi_event_ap_staconnected_t *) event_base;
wifi_event_ap_stadisconnected_t *event = (wifi_event_ap_stadisconnected_t *) event_base;

// FIX (should cast event_data):
wifi_event_ap_staconnected_t *event = (wifi_event_ap_staconnected_t *) event_data;
wifi_event_ap_stadisconnected_t *event = (wifi_event_ap_stadisconnected_t *) event_data;

5b. Potential NULL deref on strlen(psk) for open networks

When psk is NULL (open network), the bounds check strlen(psk) > ... will dereference NULL.


6. ℹ️ Items reviewed and found safe

  • send_scan_error_from_task() passing Heap by value: Safe — the stack heap remains valid for the duration of the call, and port_send_message_from_task() copies terms into mailbox storage before returning.
  • Thread safety of send_scan_results(): Uses memory_init_heap + port_send_message_from_task + memory_destroy_heap_from_task, which is the standard safe pattern in AtomVM.
  • Direct vs callback path logic in Erlang: No native-memory issue; the same {scan_results, Results} term arrives in both cases.

Priority

# Severity Fix
1 🔴 High Reorder map keys to sorted atom order
2 🟡 Medium Fix test typo resisterregister
3 🟡 Medium Pass &return_results to esp_wifi_scan_get_ap_records
4 🟡 Low Remove redundant heap size terms
5a 🟠 Pre-existing Fix event_baseevent_data cast
5b 🟠 Pre-existing Guard strlen(psk) against NULL

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 27, 2026

New bug introduced: Every if check uses = (assignment) instead of == (comparison):

if (UNLIKELY(atom = term_invalid_term())) // assigns, always "succeeds"

Should be:

if (UNLIKELY(atom == term_invalid_term()))

This appears on every check in the function (lines 266, 270, 274, etc.). As written, each check assigns term_invalid_term() to atom and since term_invalid_term() is likely 0, the condition is always false — so the function always returns true and never catches failures. It's not actively harmful (just a no-op guard), but it also clobbers the valid atom value.

Summary of fixes applied:
✅ Map key ordering fixed (authmode < bssid < channel < rssi < ssid)
✅ resister → register typo fixed
✅ esp_wifi_scan_get_ap_records(&return_results, ...) fixed
✅ Heap size double-count removed
✅ event_base → event_data cast fixed
✅ NULL guard on strlen(psk) added
✅ Test improvements (timeouts, child process tracking)
🔴 New bug: = vs == in all ensure_scan_atoms_exist checks
⚠️ Stray ; on line 348 (cosmetic)

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

New bug introduced: Every if check uses = (assignment) instead of == (comparison):

if (UNLIKELY(atom = term_invalid_term())) // assigns, always "succeeds"

Well that's embarrassing! Definitely a symptom of loosing too much sleep over my incorrect map ordering for the last couple weeks ;-)

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 27, 2026

Code Review Round 2: network:wifi_scan/0,1 (commits 5248f7c, 8e4d322)

Previous round findings (map key ordering, = vs ==, event_base/event_data cast, NULL deref on psk, heap overcount, test typo) have all been fixed.


1. 🔴 Uncommon auth modes still create atoms from task context

File: network_driver.cauthmode_to_atom_term() (line ~158) and ensure_scan_atoms_exist() (line ~260)

ensure_scan_atoms_exist() pre-registers only 7 common auth atoms:
open, wep, wpa_psk, wpa2_psk, wpa_wpa2_psk, wpa3_psk, wpa2_wpa3_psk

But authmode_to_atom_term() can also return these (not pre-registered):

  • eap, wapi, wpa3_enterprise_192, owe
  • wpa3_ext_psk, wpa3_ext_psk_mixed (IDF ≥ 5.2)
  • dpp (IDF ≥ 5.3)
  • wpa3_enterprise, wpa2_wpa3_enterprise (IDF ≥ 5.4)
  • wpa_enterprise
  • dummy1dummy5 (older IDF versions)

If any of these auth modes is encountered during a scan, make_atom() is called from the ESP-IDF event loop task context. This is:

  1. Unsafe on AVM_NO_SMP builds — atom table locks compile out, so concurrent insertion from the event task and the scheduler is a data race.
  2. Fragile on SMP builds — the insertion can allocate, rehash, or fail with OOM in callback context.
  3. If make_atom() fails, authmode_to_atom_term() returns term_invalid_term(), which gets inserted into the map — corrupting it.

Fix

Pre-register all auth atoms that authmode_to_atom_term() can return inside ensure_scan_atoms_exist(), then switch authmode_to_atom_term() to use globalcontext_existing_term_from_atom_string() instead of make_atom(). Add a safe fallback (e.g., ERROR_ATOM or a pre-registered unknown_authmode atom) in the default: / WIFI_AUTH_MAX case so it never returns term_invalid_term().


2. 🟡 Test uses [active] which is not a recognized scan option

File: test_wifi_scan.erl — lines 44, 96, 115

network:wifi_scan([active])

The atom active is not a recognized option in either the Erlang API (wifi_scan/1 looks for passive, dwell, results, show_hidden) or the C driver. It is silently ignored — the scan runs with defaults, which happen to be active scan anyway.

Fix

Replace [active] with [] or remove the argument to use wifi_scan/0, or use [{passive, false}] if explicit active mode is the intent.


3. 🟡 cancel_scan_test is flaky (timing-dependent)

File: test_wifi_scan.erl — lines 78–122

The test relies on a timer:sleep(50) to ensure the child process's gen_server:call is queued before the parent's. This is not deterministic — under load or on slower chips, the child may not have called wifi_scan yet.

Fix (deterministic approach)

Start network with a scan_done callback, then:

ok = network:wifi_scan(),                %% starts scan (non-blocking with callback)
{error, busy} = network:wifi_scan([]),   %% immediately busy
%% wait for callback result

This tests busy-detection without cross-process timing.


4. 🟡 network_stop_while_scanning_test assumes unguaranteed semantics

File: test_wifi_scan.erl — lines 124–151

The test expects to receive either {error, canceled} or {Num, Networks} after network:stop(). But:

  • network:stop/0 calls cancel_scan then gen_server:stop — it does not explicitly deliver {error, canceled} to the scan_done callback.
  • The scan may have already completed (callback already fired), or the gen_server may terminate before the spawned callback process delivers results.
  • {error, canceled} is not produced anywhere in the current cancel path.

Fix

Either:

  • Weaken the test: assert only that network:stop() does not hang or crash (the simple contract).
  • Strengthen the code: explicitly send {error, canceled} to the scan receiver during the cancel/stop path (the stronger contract), then the test is valid.

5. ℹ️ Items reviewed and found acceptable

Item Status
BOXED_INT_SIZE * 2 for rssi/channel Over-allocation only; small ints are immediate, no boxing needed. Not a corruption risk, minor memory waste.
send_scan_error_from_task takes Heap by value Safe — port_send_message_from_task deep-copies terms into mailbox before return; stack heap remains valid throughout. Awkward API but not a bug.
Heavy work in scan_done_handler (event loop task) Acceptable at current MAX_SCAN_RESULTS caps (10–64). Worth refactoring later if event loop latency becomes visible.
All previous round-1 fixes Confirmed applied correctly.


Update: verification of commit 601c62f

Fixed in this push

  • Memory leak: Added missing free(ap_records) in the error path after esp_wifi_scan_get_ap_records fails (line 568–569). Good catch.
  • Test cleanup: network_stop_while_scanning_test now uses try/after for erlang:unregister, and callback uses whereis guard.

Still open from Round 2

# Status
1 🔴 Auth atoms from task context ❌ Not addressed
2 🟡 [active] not a valid option ❌ Still used (stop test line still has [active])
3 🟡 cancel_scan_test flaky ❌ Not changed
4 🟡 Stop test semantics ⚠️ Partially improved cleanup, but still expects {error, canceled} which the code doesn't produce

New issue introduced

6. 🔴 erlang:register(stop_test, self()) accidentally removed

File: test_wifi_scan.erlnetwork_stop_while_scanning_test/0

The try/after refactor removed the erlang:register(stop_test, self()) call, but the callback still references the registered name:

scan_callback_handler(Results) ->
    case erlang:whereis(stop_test) of
        undefined ->
            erlang:error({lost_parent, stop_test});   %% will always hit this
        Pid ->
            Pid ! Results
    end.

Without the register call, stop_test is never registered, so the callback will always crash with {lost_parent, stop_test}. The register line needs to be restored inside the try block before network:start/1.


Priority

# Severity Fix Scope
1 🔴 High Pre-register all auth atoms; use existing-atom lookup in task context; add safe fallback S–M
6 🔴 High Restore erlang:register(stop_test, self()) in stop test S
2 🟡 Medium Replace [active] with [] or [{passive, false}] in tests S
3 🟡 Medium Make cancel_scan_test deterministic (no sleep-based ordering) S
4 🟡 Medium Align stop-while-scanning test with actual code contract S

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

  1. 🔴 Uncommon auth modes still create atoms from task context

I'm not 100% sold on this. For example dummy1 - dummy5 will never be observed in the wild... those are really just placeholders that are later replaced in future IDF versions with real authmodes. Adding all of the atoms up front will consume more memory permanently and could pollute the atoms table with atoms that are likely never used. I went ahead an added the real authmodes, but the dummy# are not pre-populated.

),
case network:wifi_scan() of
{error, Reason} ->
io:format("wifi_scan failed for reason ~p", [Reason]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing ~n here. Also (unrelated), I personally favor \n as it avoids a replacement.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed. I also replaced the use of ~n with \n in the other example and tests.

esp32 -> ok;
_ -> error(unsupported_platform)
end,
Passive = proplists:get_value(passive, Options, false),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to use get_bool/2 here (and elsewhere in the file?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks cleaner. I believe I caught all of the places where that makes sense.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 29, 2026

Usual caveats - so pick and choose

PR Review: Add network:wifi_scan/0,1 to ESP32 Network Driver

Commits reviewed:

  • 7f46fc7dc — Add network:wifi_scan/0,1 to esp32 network driver
  • 8bb589aed — Document new wifi functions

Files changed: 10 files, +1469 / −33 lines


Summary

These commits add WiFi scanning capability (wifi_scan/0,1) to the AtomVM ESP32 network driver, supporting both blocking and callback modes, with scan cancellation and configurable options (active/passive, dwell time, max results, hidden networks). The implementation spans the Erlang gen_server (network.erl), the C native driver (network_driver.c), tests, examples, and documentation.


Findings

1. 🔴 HIGH — Cancel/stop path races and leaks ScanData

Files: network_driver.c, network.erl

wifi_scan() heap-allocates a struct ScanData (line ~1635) which is only freed inside send_scan_results() when the handler runs successfully and unregister succeeds. Neither cancel_scan() nor stop_network() free it:

  • cancel_scan() calls esp_wifi_scan_stop() but has no access to the ScanData pointer.
  • stop_network() unregisters scan_done_handler but also cannot free the arg data.

If the handler is removed before it fires, the ScanData is leaked. ESP-IDF documents that esp_wifi_scan_stop() also triggers WIFI_EVENT_SCAN_DONE, making this race non-hypothetical.

On the Erlang side, handle_info has two competing terminal messages (scan_results and scan_canceled), whichever arrives first wins — the test at test_wifi_scan.erl:146-155 explicitly accepts either outcome, codifying the nondeterminism.

Suggested fix: Track the active scan context in driver-owned state. On cancel, mark it canceled. In WIFI_EVENT_SCAN_DONE, check the canceled flag and emit exactly one terminal outcome, freeing the context in one place.


2. 🔴 HIGH — scan_done_handler ignores scan event status

File: network_driver.c:844-852

static void scan_done_handler(void *arg, esp_event_base_t event_base,
                              int32_t event_id, void *event_data)
{
    UNUSED(event_data)   // <-- status is ignored!
    struct ScanData *data = (struct ScanData *) arg;
    if (event_base == WIFI_EVENT && event_id == WIFI_EVENT_SCAN_DONE) {
        send_scan_results(data);
    }
}

event_data contains wifi_event_sta_scan_done_t with a status field that distinguishes success from cancellation/failure. Ignoring it means canceled or failed scans can still produce "successful" results.

Suggested fix:

 static void scan_done_handler(void *arg, esp_event_base_t event_base,
                               int32_t event_id, void *event_data)
 {
-    UNUSED(event_data)
     struct ScanData *data = (struct ScanData *) arg;
 
     if (event_base == WIFI_EVENT && event_id == WIFI_EVENT_SCAN_DONE) {
-        ESP_LOGD(TAG, "Scan complete.");
-        send_scan_results(data);
+        wifi_event_sta_scan_done_t *scan_done = (wifi_event_sta_scan_done_t *) event_data;
+        if (scan_done->status == 0) {
+            ESP_LOGD(TAG, "Scan complete.");
+            send_scan_results(data);
+        } else {
+            ESP_LOGW(TAG, "Scan ended with status: %u", scan_done->status);
+            esp_wifi_clear_ap_list();
+            esp_err_t err = esp_event_handler_unregister(
+                WIFI_EVENT, WIFI_EVENT_SCAN_DONE, &scan_done_handler);
+            BEGIN_WITH_STACK_HEAP(PORT_REPLY_SIZE + TUPLE_SIZE(2) + TUPLE_SIZE(2), heap);
+            send_scan_error_from_task(
+                data->global, data->owner_process_id,
+                globalcontext_make_atom(data->global,
+                    ATOM_STR("\x8", "canceled")),
+                data->ref_ticks, heap);
+            END_WITH_STACK_HEAP(heap, data->global);
+            if (LIKELY(err == ESP_OK)) {
+                free(data);
+            }
+        }
     }
 }

3. 🟡 MEDIUM — stop/0 is not atomic and races with concurrent scan starts

File: network.erl:444-456

stop() ->
    case gen_server:call(?SERVER, get_scan_state) of
        inactive ->
            gen_server:stop(?SERVER);
        active ->
            case gen_server:call(?SERVER, cancel_scan) of
                ok ->
                    gen_server:stop(?SERVER);
                {error, Reason} ->
                    {error, {scan_not_stopped, Reason}}
            end
    end.

This is a multi-call protocol outside the server. Another process can start a scan between the get_scan_state and gen_server:stop calls.

Suggested fix: Replace with a single gen_server:call(?SERVER, stop_network) that atomically cancels any active scan and terminates:

 stop() ->
-    case gen_server:call(?SERVER, get_scan_state) of
-        inactive ->
-            gen_server:stop(?SERVER);
-        active ->
-            case gen_server:call(?SERVER, cancel_scan) of
-                ok ->
-                    gen_server:stop(?SERVER);
-                {error, Reason} ->
-                    {error, {scan_not_stopped, Reason}}
-            end
-    end.
+    gen_server:call(?SERVER, stop_network).

With a corresponding handle_call(stop_network, ...) clause that cancels scan if active, then triggers gen_server:stop.


4. 🟡 MEDIUM — Callback-mode wifi_scan returns ok before scan is validated by C driver

Files: network.erl:683-689, network_driver.c:1520-1540

When scan_done callback is configured:

handle_call({scan, ScanOpts}, From, #state{...} = State) ->
    network_port ! {self(), Ref, {scan, ScanOpts}},
    case proplists:get_value(scan_done, ...) of
        undefined -> {noreply, State#state{scan_receiver = From}};
        FunOrPid  -> {reply, ok, State#state{scan_receiver = FunOrPid}}
    end;

The caller gets ok before the C driver validates mode, options, allocations, or esp_wifi_scan_start(). Invalid inputs or OOM silently return ok and errors arrive asynchronously.

Suggested fix: Use a two-phase protocol where C sends a {scan_started, ok | {error, Reason}} acknowledgement before returning ok to the caller.


5. 🟡 MEDIUM — Hidden SSIDs are rewritten to <<"(hidden)">>, losing fidelity

File: network_driver.c:376-383

if (ssid_size > 0) {
    ssid_term = term_from_literal_binary(ap_records[i].ssid, ssid_size, heap, global);
} else {
    static const char *hidden_ap = "(hidden)";
    ssid_term = term_from_const_binary((const uint8_t *) hidden_ap, strlen(hidden_ap), heap, global);
}

This forges data — a hidden AP gets the literal SSID "(hidden)" which is indistinguishable from a real AP named "(hidden)".

Suggested fix:

     if (ssid_size > 0) {
         ssid_term = term_from_literal_binary(ap_records[i].ssid, ssid_size, heap, global);
     } else {
-        static const char *hidden_ap = "(hidden)";
-        ssid_term = term_from_const_binary((const uint8_t *) hidden_ap, strlen(hidden_ap), heap, global);
+        ssid_term = term_from_literal_binary((const uint8_t *) "", 0, heap, global);
     }

Returning <<>> (empty binary) for hidden SSIDs preserves fidelity. Optionally add a hidden => true key to the map.


6. 🟡 MEDIUM — Rare atom-allocation failure paths send unhandled message shapes

Files: network_driver.c:1510-1517, 1679-1686; network.erl:748-763

On rare failures (e.g. OOM allocating the scan_results atom), C sends a plain {error, out_of_memory} via port_send_reply:

term ret = port_create_error_tuple(ctx, OUT_OF_MEMORY_ATOM);
port_send_reply(ctx, pid, ref, ret);
return;

But Erlang only handles {Ref, {scan_results, ...}} and {Ref, {scan_canceled, ...}}. The blocking caller would hang until gen_server timeout, and scan_receiver remains occupied — future scans return {error, busy} forever.

Suggested fix: Always wrap scan failures in the {scan_results, {error, Reason}} envelope, or pre-create those atoms at driver init time so this branch is unreachable.


7. 🔵 LOW — scan_receiver type dispatch is brittle; malformed tuples treated as gen_server From

File: network.erl:815-830

case Receiver of
    From when is_tuple(From) ->
        gen_server:reply(From, ...);
    Pid when is_pid(Pid) ->
        Pid ! Msg;
    Fun when is_function(Fun, 1) ->
        Fun(Results)
end

A malformed config like {scan_done, {foo, bar}} passes is_tuple, so it is treated as a gen_server From value, causing a crash in the spawned dispatcher.

Suggested fix: Use tagged wrappers:

-    {noreply, State#state{scan_receiver = From}};
+    {noreply, State#state{scan_receiver = {reply, From}}};
 ...
-    {reply, ok, State#state{scan_receiver = FunOrPid}}
+    {reply, ok, State#state{scan_receiver = {callback, FunOrPid}}}

And validate scan_done config at start/connect time.


8. 🔵 LOW — wifi_scan/0,1 crashes with noproc if server not started

File: network.erl:577-627

If the network server is not started, gen_server:call(?SERVER, ...) raises an exit, but the spec says {error, Reason}.

Suggested fix:

 wifi_scan(Options) ->
+    case erlang:whereis(?SERVER) of
+        undefined -> {error, not_started};
+        _ ->
     ...
+    end.

9. 🔵 LOW — Documentation inaccuracies

File: doc/src/network-programming-guide.md

  • States the default scan is always "active, 120ms, 6 APs" but wifi_scan/0 actually reads startup-configured defaults from sta_config() (default_scan_results, scan_dwell_ms, scan_passive, scan_show_hidden).
  • Callback-mode docs say ok is returned after successful initiation, which is not guaranteed (finding Darwin support #4).
  • The example code in the guide is missing a comma after the {scan_done, fun got_scan_results/1} entry (line ~105).

10. 🔵 LOW — Tests codify the cancel race rather than catching it

File: test_wifi_scan.erl

  • cancel_scan_test/0 tests concurrent scans / {error, busy}, not actual cancellation.
  • network_stop_while_scanning_test/0 accepts both {error, canceled} and real results, codifying the nondeterminism rather than asserting deterministic behavior.

Recommended additional test coverage:

  • Stop during scan should produce one deterministic terminal outcome
  • No second scan callback/result should arrive after cancel
  • Callback mode with invalid options should not return ok

11. 🔵 LOW — Copyright year is 2023 in new 2024 files

Files: test_wifi_scan.erl, wifi_scan.erl, wifi_scan_callback.erl

All new files have Copyright (c) 2023 but the commit dates are from 2024.


Overall Assessment

The happy-path implementation is solid and well-structured — the C driver carefully handles heap allocation, atom pre-creation, and per-chip memory limits. The Erlang API design with blocking/callback modes is flexible. Documentation and examples are thorough.

The main concern is the scan lifecycle state machine: scan completion is not single-owner/single-terminal, which leads to races between scan_results and scan_canceled messages, potential ScanData leaks on cancel/stop, and nondeterministic behavior that the tests themselves acknowledge. The recommended fix is to treat scan as a small state machine with one terminal event driven by the WIFI_EVENT_SCAN_DONE status field.

@UncleGrumpy
Copy link
Copy Markdown
Collaborator Author

UncleGrumpy commented Mar 30, 2026

  1. The stop handling has been updated so that when the driver is stopped, any running scans must cancel successfully before the stop command is actually sent, or an error returned. When a scan is canceled a scan done event is triggered, this will send whatever result (including no networks found) the in-flight scan had obtained through the return results path, freeing the allocated ScanData and removing the handler. The stop procedure no longer removes the handler, leaving it for the callback routine to free the ScanData and remove the handler.
  2. fixed
  3. fixed
  4. fixed
  5. fixed
  6. A special handle_info/2 has been added to catch any OOM errors before or during the creation of the scan_results atom, after scan_results has been added to the atom table (only the first time a scan is performed) any errors in the callback will be wrapped in a scan_results tuple.
  7. fixed
  8. fixed
  9. fixed
  10. The test did have a few extra matches, (that would never occur, and have been remove), different atoms are used to clarify the intention of the test, lost_race was a deceptive holdover from when I was trying to follow IDF behavior and cancel any already active scans when a new one was requested. That did have multiple race issues, which have been eliminated by allowing the first scan to finish, and deny any other concurrent scan requests with an {error, busy} from the gen_server.
  11. Some files were started in late 2023, but not committed until 2024, not sure how big of a deal that really is, or which is the correct choice. The header for wifi_scan_callback.erl was just copied over, it has been updated to reflect the actual creation year.

@petermm
Copy link
Copy Markdown
Contributor

petermm commented Mar 31, 2026

I believe these are the last nitpicks - great work!

https://ampcode.com/threads/T-019d42c8-9813-7760-b270-1e5b7cf2af44

wifi_scan PR Review Issues

1. Race condition: blocking scan caller may not receive reply on shutdown

Files: network.erl:775-784

When stop/0 is called during a blocking scan (scan_receiver = {reply, From}), the
cancel_scan command triggers esp_wifi_scan_stop(), which asynchronously fires a
WIFI_EVENT_SCAN_DONE event. The scan_done_handler will eventually send scan_results
back to the gen_server, which replies to the scan caller (line 775-776).

However, on receiving {scan_canceled, {ReplyTo, shutdown}, ok} (line 778), the gen_server
replies ok to the stop caller and immediately returns {stop, normal, ...}. If the
gen_server terminates before the async scan_results message arrives, the blocked wifi_scan
caller gets a noproc exit instead of a clean {ok, Results} or {error, scan_canceled}.

Suggestion: Before stopping, reply to the scan caller with an error so it isn't stranded:

 handle_info({Ref, {scan_canceled, {ReplyTo, Next}, ok}}, #state{ref = Ref} = State) ->
     gen_server:reply(ReplyTo, ok),
     case Next of
         shutdown ->
-            {stop, normal, State#state{scan_receiver = undefined}};
+            case State#state.scan_receiver of
+                {reply, ScanFrom} ->
+                    gen_server:reply(ScanFrom, {error, stopped});
+                _ ->
+                    ok
+            end,
+            {stop, normal, State#state{scan_receiver = undefined}};
         _ ->
             {noreply, State#state{scan_receiver = undefined}}
     end;

2. stop/0 fallback path references wrong registered name

File: network.erl:461

The fallback checks erlang:whereis(network_driver) but the port is registered as
network_port (see open_port/0). The C driver also does not reply to stop, so even with
the correct name this path would hang until the 5000ms timeout.

         _ ->
-            case erlang:whereis(network_driver) of
-                Driver when is_pid(Driver) ->
-                    Ref = make_ref(),
-                    Driver ! {?SERVER, Ref, stop},
-                    receive
-                        {Ref, Result} -> Result
-                    after 5000 ->
-                        {error, timeout}
-                    end;
-                undefined ->
-                    ok
-            end
+            ok
     end.

3. wait_scan_start_reply/3 has no timeout — can wedge gen_server

File: network.erl:842-850

In callback mode, handle_call({scan,...}) blocks inside wait_scan_start_reply/3.
If the port reply is lost or delayed, the gen_server is stuck in that receive forever.
Subsequent calls (stop, sta_status, etc.) will all queue behind it.

 wait_scan_start_reply(Ref, Dispatch, State) ->
     receive
         {Ref, ok} ->
             {reply, ok, State#state{scan_receiver = Dispatch}};
         {Ref, {error, _} = Error} ->
             {reply, Error, State#state{scan_receiver = undefined}};
         {Ref, {scan_results, {error, _} = Error}} ->
-            {reply, Error, State#state{scan_receiver = undefined}}
+            {reply, Error, State#state{scan_receiver = undefined}};
+    after 10000 ->
+            {reply, {error, timeout}, State#state{scan_receiver = undefined}}
     end.

4. network_properties() type is missing the hidden key

File: network.erl:245-253

The runtime map and all examples/tests include hidden, but the typespec omits it.

 -type network_properties() ::
     #{
         authmode := auth_type(),
         bssid := bssid_t(),
         channel := wifi_channel(),
+        hidden := boolean(),
         rssi := dbm(),
         ssid := ssid()
     }.
-%% A map of network properties with the keys: `ssid', `rssi', `authmode', `bssid', and `channel'
+%% A map of network properties with the keys: `ssid', `rssi', `authmode', `bssid', `channel', and `hidden'

5. UNUSED(event_data) macro contradicts actual usage in scan_done_handler

File: network_driver.c:851-857

event_data is marked UNUSED but is dereferenced two lines later. Some compilers may
optimize it away or emit warnings.

 static void scan_done_handler(void *arg, esp_event_base_t event_base, int32_t event_id, void *event_data)
 {
-    UNUSED(event_data)
     struct ScanData *data = (struct ScanData *) arg;

Copy link
Copy Markdown
Collaborator

@bettio bettio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything good, except a couple of minor changes. I think we are close to merge.

// (scan_done_handler) is registered and unregistered per request. We catch this here so that
// we can subscribe to all wifi events in network_start, otherwise each event needs to be
// subscribed and unsubscribed individually.
asm("nop");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is really needed? It looks super suspicious: an empty

case WIFI_EVENT_SCAN_DONE: {
    break;

should be removed from the optimizer, since that would alter program semantic.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed from the optimizer, since that would alter program semantic.

Indeed, this is not needed. I added this while testing and debugging, I was receiving spurious "Unhandled wifi event: 1" messages (event 1 is WIFI_EVENT_SCAN_DONE) in the gen_server, and I thought that perhaps that was due to this being optimized away, but as you point out that shouldn't happen - it would break the control flow. I am no longer seeing those spurious messages, that must have been a separate bug that I already resolved.

static void scan_done_handler(void *arg, esp_event_base_t event_base, int32_t event_id, void *event_data)
{
UNUSED(event_data)
struct ScanData *data = (struct ScanData *) arg;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C cast from void * to any pointer type is always implicit without an additiona cast.
this should work perfectly: struct ScanData *data = arg;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, not sure why I did that. fixed.

return globalcontext_make_atom(global, atom) != term_invalid_term();
}

static bool ensure_scan_atoms_exist(GlobalContext *global)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this instead of adding the necessary atoms to the platform atoms?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a lot of atoms to have in the table if this one function is never used. There is a penalty the first time the a scan is performed when the atoms are created which seemed like a fair trade off.

{stop, normal, State};
handle_info({Ref, {scan_canceled, {_, ReplyTo}, Error}}, #state{ref = Ref} = State) ->
gen_server:reply(ReplyTo, Error),
{noreply, State};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to clear scan_receiver and maybe run callback?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory there should never be an error when a scan is canceled internally, the only error returns are ESP_ERR_WIFI_NOT_INIT, ESP_ERR_WIFI_NOT_STARTED, and ESP_ERR_WIFI_STATE. The first two are prevented by checks before a cancel scan is used. ESP_ERR_WIFI_STATE indicated that the driver is negotiating a connection to an AP. This would only be emitted by a scan that was triggered by starting a connection to an access point by ssid and the driver internally initiates the scan to find the specified access point. Any scans initiated by the user would return the results to the callback (or caller) and clear the scan_receiver before they would initiate a connection, otherwise the scan would fail with an error if a association was already being initiated.

In this case it seems better to let any scan results, or errors to propagate through the internal scan_done_handler and make their way to the callback or caller, instead of having the waiting caller (for direct calls without a callback) of the scan hang forever, or send an erroneous canceled message if we failed to cancel the scan. Either way, this is only used at shutdown, so sending any error about the cancel failure to the callback seems unnecessary, it is the process that initiated the shutdown that is concerned with the the cancel results. The callback (or caller) that is concerned about the actual scan results will get an {error, canceled} if its scan was terminated.

// to access the free'd data and cause a hard crash.
free(data);
}
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't send {error, unregister_handler} here, so scan_receiver is set to undefined and not to blocked, maybe allowing to register a second handler?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I only added the block at the last minute, otherwise there could be a memory leak if other handlers were registered. I forgot to go back an update all of the early return errors. I addressed this with a goto for readability and size.

esp32 -> ok;
_ -> error(unsupported_platform)
end,
case erlang:whereis(?SERVER) of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not atomic. Instead, you should try / catch:

try
    gen_server:call(?SERVER, get_config)
catch
    exit:{noproc, _} -> {error, not_started}
end

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

-type sta_status() ::
associated | connected | connecting | degraded | disconnected | disconnecting | inactive.

-type scan_options() ::
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a single option:

Suggested change
-type scan_options() ::
-type scan_option() ::

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Use new version 2 logging on WiFi6 capable devices (ESP32-C5 and
ESP32-C61) to keep binary size within the allotted partition size.

Signed-off-by: Winford <winford@object.stream>
Avoid a possible crash when connecting to an open network, by not
de-refernencing a NULL pointer. Fix incorrect cast that may lead to
crashes or unexpected behaviors during client connection events when AP
mode is enabled.

Signed-off-by: Winford <winford@object.stream>
Corrects the type name db() to the more correct `dbm()`, and adds a brief
edoc explanation for the value.

Signed-off-by: Winford <winford@object.stream>
Signed-off-by: Winford <winford@object.stream>
Adds documentation for `network:wifi_scan/0,1` and updates details for
`network:sta_rssi/0`.

Signed-off-by: Winford <winford@object.stream>
@petermm
Copy link
Copy Markdown
Contributor

petermm commented Apr 5, 2026

PR Review: Add network:wifi_scan/0,1 to ESP32 Network Driver

Commits reviewed:

  • 44fba79c7 — Add network:wifi_scan/0,1 to esp32 network driver
  • d725788e1 — Document new wifi functions

Files changed: 10 files, +1615 / -34 lines

Verdict: Good feature addition. Fix the stop/cancel handler lifecycle and the bad-dwell crash before shipping.


Findings

1. 🟡 MEDIUM (revised) — stop_network() during active scan may leak ScanData and stale handler

Files: network_driver.c (~L1343-1376, L1684-1690)

Originally flagged as HIGH for both cancel_scan() and stop_network(). On closer analysis, cancel_scan() is fine: esp_wifi_scan_stop() triggers WIFI_EVENT_SCAN_DONE (per ESP-IDF docs), which scan_done_handler catches and cleans up (unregister + free). The generic event_handler's no-op WIFI_EVENT_SCAN_DONE case does not block the specific handler — both are registered subscribers.

The remaining concern is the cancel→shutdown race: When stop/0 is called during an active scan, the Erlang side sends cancel_scan to the port. cancel_scan() calls esp_wifi_scan_stop() which posts WIFI_EVENT_SCAN_DONE asynchronously to the ESP-IDF event loop task, then immediately replies {scan_canceled, ...}. Erlang receives this, does {stop, normal, ...}terminate/2 → sends stop to port → C-side stop_network() calls esp_wifi_stop()/esp_wifi_deinit(). This can all happen before the async scan_done_handler fires, leaving ScanData leaked and the handler registered.

Note: the scan_receiver = undefined direct-stop branch (L751) is safe — gen_server state only updates when messages are handled, so if scan results are queued but unprocessed, scan_receiver is still active and the cancel path is taken.

Suggested fix: Don't let Erlang proceed to shutdown until scan cleanup is confirmed. Either:

  1. Move the final {scan_canceled, ...} reply into scan_done_handler itself (so Erlang only stops after cleanup is done), or
  2. In stop_network(), explicitly unregister scan_done_handler and free ScanData if a scan is active (requires tracking scan state in driver-owned state, ideally using esp_event_handler_instance_register/unregister).

2. 🟡 MEDIUM — network:stop/0 mishandles the blocked state

File: network.erl (~L751-756, L798-802)

If scan handler unregistration fails, handle_info/2 sets scan_receiver = blocked. But handle_call(stop_network, ...) only stops immediately when scan_receiver =:= undefined. For blocked, it goes through the cancel path even though no scan is active, and the server may stay alive.

Suggested fix:

-handle_call(stop_network, From, #state{scan_receiver = undefined} = State) ->
+handle_call(stop_network, From, #state{scan_receiver = ScanReceiver} = State)
+    when ScanReceiver =:= undefined; ScanReceiver =:= blocked ->
     gen_server:reply(From, ok),
     {stop, normal, State};

3. 🟡 MEDIUM — Unregister failure in failed-scan path not surfaced to Erlang

File: network_driver.c (~L860-884)

In scan_done_handler(), when scan_done->status != 0, the code sends {error, scan_failed}. If esp_event_handler_unregister(...) then fails, it only logs but does not send {error, {unregister_handler, Reason}}. This makes the behavior inconsistent with other unregister-failure paths, and Erlang never moves to blocked state.

Suggested fix: Reuse the same unregister-failure signaling used in send_scan_results() — send {scan_results, {error, {unregister_handler, Reason}}} so Erlang moves to blocked.


4. 🟡 MEDIUM — wifi_scan/1 crashes on invalid dwell before driver validation

File: network.erl (~L614-623)

The Erlang wrapper computes ComputedTimeout = (Dwell * NumChannels) before validating that Dwell is an integer. So network:wifi_scan([{dwell, foo}]) crashes with badarith instead of returning an error tuple.

Suggested fix:

     Dwell =
         case {proplists:get_value(dwell, Options), Passive} of
             {undefined, false} -> ?DEFAULT_ACTIVE_DWELL;
             {undefined, _} -> ?DEFAULT_PASSIVE_DWELL;
             {Value, _} -> Value
         end,
-    {NumChannels, DefaultTimeout} = get_num_channels_timeout(),
-    ComputedTimeout = (Dwell * NumChannels),
-    Timeout = erlang:max(DefaultTimeout, ComputedTimeout) + ?GEN_RESPONSE_MS,
-    case erlang:whereis(?SERVER) of
-        undefined -> {error, not_started};
-        _ -> gen_server:call(?SERVER, {scan, Options}, Timeout)
-    end.
+    case is_integer(Dwell) andalso Dwell >= 1 andalso Dwell =< 1500 of
+        false ->
+            {error, badarg};
+        true ->
+            {NumChannels, DefaultTimeout} = get_num_channels_timeout(),
+            ComputedTimeout = Dwell * NumChannels,
+            Timeout = erlang:max(DefaultTimeout, ComputedTimeout) + ?GEN_RESPONSE_MS,
+            case erlang:whereis(?SERVER) of
+                undefined -> {error, not_started};
+                _ -> gen_server:call(?SERVER, {scan, Options}, Timeout)
+            end
+    end.

5. 🟡 MEDIUM — Docs claim stop/0 is unimplemented

File: network-programming-guide.md (~L462-472)

The guide still says "Stop is currently unimplemented." — this is false and misleading given the new scan-aware shutdown behavior.

Suggested fix:

-```{caution}
-Stop is currently unimplemented.
-```
+`network:stop/0` stops the network server and underlying driver.
+If called while a WiFi scan is in progress, the scan caller or callback may
+receive either the final scan result or `{error, canceled}`.

6. 🔵 LOW — API/spec inconsistency: wifi_scan throws on non-ESP32

File: network.erl (~L605-613, L631-639)

wifi_scan/0,1 raises error(unsupported_platform) but the spec only advertises ok | {ok, scan_results()} | {error, term()}.

Suggested fix: Return {error, unsupported_platform} for consistency with the rest of the API surface.

-        _ -> error(unsupported_platform)
+        _ -> {error, unsupported_platform}

7. 🔵 LOW — get_scan_state reports blocked as active

File: network.erl (~L744-749)

Scanning =
    case Active of
        undefined -> inactive;
        _ -> active
    end,

The blocked state (meaning future scans are permanently denied) is reported as active, which is misleading.

Suggested fix:

 Scanning =
     case Active of
         undefined -> inactive;
+        blocked -> blocked;
         _ -> active
     end,

8. 🟡 MEDIUM — Test coverage gaps

File: test_wifi_scan.erl

Current tests cover: happy path, concurrent scan denial, stop-while-scanning. Missing:

  • Invalid options ({results, 0}, {results, 100}, {dwell, foo}, {passive, foo})
  • wifi_scan/0 default-option path
  • Callback delivery to a pid() (not just function callback)
  • not_started / no_sta_mode error paths
  • AP+STA mode behavior

Also, the happy-path test asserts <<"Wokwi-GUEST">> is present — brittle outside the Wokwi simulator.


9. 🔵 LOW — Scan error shapes are inconsistent

File: network_driver.c (various)

Some scan failures return atoms (badarg, unsupported_mode, scan_failed, canceled), while others return binary strings from esp_err_to_name(err). This makes caller-side matching awkward and leaks ESP-IDF naming into the Erlang API. Consider normalizing to atoms for common cases and {esp_err, <<"...">>} for raw driver errors.


Summary

# Severity Issue
1 🟡 MEDIUM stop_network() during active scan may leak ScanData (cancel path is fine)
2 🟡 MEDIUM stop/0 mishandles blocked state
3 🟡 MEDIUM Unregister failure in failed-scan path not surfaced
4 🟡 MEDIUM wifi_scan/1 crashes on invalid dwell
5 🟡 MEDIUM Docs claim stop/0 is unimplemented
6 🔵 LOW wifi_scan throws instead of returning error on non-ESP32
7 🔵 LOW get_scan_state reports blocked as active
8 🟡 MEDIUM Test coverage gaps
9 🔵 LOW Inconsistent scan error shapes

Recommended before merge: Fix items 2, 4. Items 1, 3, 5, 8 are strongly recommended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Wifi Scanning

5 participants