Skip to content

Investigate issue #3302 driver behavior when upsd aborts#3368

Open
jimklimov wants to merge 103 commits intonetworkupstools:masterfrom
jimklimov:issue-3302-drvconn
Open

Investigate issue #3302 driver behavior when upsd aborts#3368
jimklimov wants to merge 103 commits intonetworkupstools:masterfrom
jimklimov:issue-3302-drvconn

Conversation

@jimklimov
Copy link
Copy Markdown
Member

@jimklimov jimklimov commented Mar 24, 2026

Start by poking upsdrvcrl for both WIN32 and POSIX builds...

Includes code from PR #3367 to try reproducing the issue.

UPDATE: Maybe specific to dummy-ups, reproduced both for standalone starts of the driver program directly, one driver via upsdrvctl (note: the latter does not seem to propagate the exit-code and returns 0, at least on Windows, probably should indicate an error), and a swarm of drivers via upsdrvctl (also exits with code 0 even if all drivers died abruptly). Sometimes it took several starts of upsd to be killed a few seconds later.

In all these cases the final words were like:

  • upsd
   0.011162     Connected to UPS [UPS2]: dummy-ups-UPS2
   0.011473     Connected to UPS [UPS1]: dummy-ups-UPS1
   0.011641     Connected to UPS [dummy]: dummy-ups-dummy
   0.011762     Your system limits the maximum number of connections to 64
but you requested 348. The server may handle connections
in smaller groups, maybe affecting efficiency and response time.

   0.011989     [D1:6336:upsd.exe] poll_reload: (p)re-allocate 348 entries for polling FDs and handlers
   0.012186     Found 53 UPS defined in ups.conf
...
   0.186789     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm43] driver is still not connected (FD -1)
   0.186938     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm44] driver is not currently connected, trying to reconnect
   0.187080     [D1:6336:upsd.exe] mainloop: UPS [UPSwarm44] driver is still not connected (FD -1)
(end of log)
  • with a Ctrl+C at this point, there is even no clean-up logged (are signals set up later? shouldn't be...)
  • a longer-living upsd sometimes logs the clean-up:
...
   1.842622     [D1:3404:upsd.exe] mainloop: UPS [UPSwarm50] driver is still not connected (FD -1)
   1.870824     [D1:3404:upsd.exe] upsd_cleanup: starting the end-game
   1.871082     [D1:3404:upsd.exe] driver_free: forgetting UPS [dummy] (FD 644)
   1.871209     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPS1] (FD 636)
   1.871347     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPS2] (FD 628)
   1.871459     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm1] (FD -1)
   1.871547     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm2] (FD -1)
...
   1.880913     [D1:3404:upsd.exe] driver_free: forgetting UPS [UPSwarm50] (FD -1)
   1.881051     [D1:3404:upsd.exe] upsd_cleanup: finished
  • On the dummy-ups side it seems to always end with the same entering parse_data_file() call (and exit-code 127) after failing to write to the server:
...
   3.773319     [D5:16220:dummy-ups:UPSwarm11] send_to_one: SETAUX outlet.3.id 32
   3.787747     [D6:16220:dummy-ups:UPSwarm11] send_to_one: write 22 bytes to handle 000000000000011C succeeded (ret=22): SETAUX outlet.3.id 32

   3.787949     [D2:16220:dummy-ups:UPSwarm11] send_to_one: sending SETFLAGS outlet.3.id RW STRING
   3.788031     [D5:16220:dummy-ups:UPSwarm11] send_to_one: SETFLAGS outlet.3.id RW STRING
   3.788133     WARNING: send_to_one: write 31 bytes to handle 000000000000011C failed (ret=0), disconnecting.: No error [The pipe is being closed. ]
   3.788250     [D6:16220:dummy-ups:UPSwarm11] send_to_one: failed write: SETFLAGS outlet.3.id RW STRING

   3.788364     [D3:16220:dummy-ups:UPSwarm11] sock_disconnect: disconnecting named pipe handle 000000000000011C
   3.788502     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: finishing parsing context
   3.788591     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: relinking the chain of connections
   3.788667     [D5:16220:dummy-ups:UPSwarm11] sock_disconnect: freeing the conn object
   3.788755     [D5:16220:dummy-ups:UPSwarm11] send_to_all: SETINFO driver.state "updateinfo"
   3.788861     [D1:16220:dummy-ups:UPSwarm11] upsdrv_updateinfo...
   4.793998     [D1:16220:dummy-ups:UPSwarm11] entering parse_data_file()
(end of log)

$ echo $?
127

I don't think I've reproduced nor ruled out the problem on non-Windows builds yet.

Per GDB and added debug-logging traces, it seems to crash around malloc() calls, whether in PCONF context init or in vupslog() a bit before it gets there:

...
   8.255625     [D3:15120:dummy-ups:UPSwarm11] sock_connect: new connection on handle 0000000000000120
   8.255752     [D6:15120:dummy-ups:UPSwarm11] sock_arg: Driver on \\.\pipe\dummy-ups-UPSwarm11 is now handling DUMPALL with 1 args
   8.255874     [D2:15120:dummy-ups:UPSwarm11] send_to_one: sending SETINFO device.mfr "EATON | Powerware"
   8.256006     [D5:15120:dummy-ups:UPSwarm11] send_to_one: SETINFO device.mfr "EATON | Powerware"
   8.256165     WARNING: send_to_one: write 39 bytes to handle 0000000000000120 failed (ret=0), disconnecting.: No error [The pipe is being closed. ]
   8.256322     [D6:15120:dummy-ups:UPSwarm11] send_to_one: failed write: SETINFO device.mfr "EATON | Powerware"

   8.256457     [D3:15120:dummy-ups:UPSwarm11] sock_disconnect: disconnecting named pipe handle 0000000000000120
   8.256585     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: finishing parsing context
   8.256706     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: relinking the chain of connections
   8.256798     [D5:15120:dummy-ups:UPSwarm11] sock_disconnect: freeing the conn object

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffb8325d990 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ffb8325d990 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffb8325b44d in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#2  0x00007ffb83328a18 in ntdll!RtlRegisterSecureMemoryCacheCallback () from C:\Windows\SYSTEM32\ntdll.dll
#3  0x00007ffb8325d255 in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#4  0x00007ffb8325b44d in ntdll!RtlAllocateHeap () from C:\Windows\SYSTEM32\ntdll.dll
#5  0x00007ffb8144a131 in realloc () from C:\Windows\System32\msvcrt.dll
#6  0x00007ffb81449c45 in msvcrt!calloc () from C:\Windows\System32\msvcrt.dll
#7  0x00007ff6b737f0a9 in xcalloc (number=<optimized out>, size=<optimized out>) at common.c:4760
#8  0x00007ff6b737f12e in vupslog (priority=2, fmt=0x5fed50 "[D5:15120:dummy-ups:UPSwarm11] %s: %.*s", va=0x5ff170 "", use_strerror=0) at common.c:3816
#9  0x00007ff6b737f8ce in s_upsdebugx (level=<optimized out>, level@entry=5, fmt=0x5fed50 "[D5:15120:dummy-ups:UPSwarm11] %s: %.*s",
    fmt@entry=0x7ff6b73a0e99 <__func__.16+681> "%s: %.*s") at common.c:4535
#10 0x00007ff6b7378591 in send_to_all (fmt=fmt@entry=0x7ff6b73a0f6b <__func__.16+891> "SETINFO %s \"%s\"\n") at dstate.c:305
#11 0x00007ff6b73786b0 in vdstate_setinfo (var=var@entry=0x7ff6b739cfbf <__func__.4+1047> "driver.state", fmt=fmt@entry=0x7ff6b73a08d1 <__func__.4+15657> "updateinfo",
    ap=ap@entry=0x5ff5a0 "d") at dstate.c:1386
#12 0x00007ff6b737715d in dstate_setinfo (var=var@entry=0x7ff6b739cfbf <__func__.4+1047> "driver.state", fmt=fmt@entry=0x7ff6b73a08d1 <__func__.4+15657> "updateinfo")
    at dstate.c:1412
#13 0x00007ff6b739aa7f in main (argc=<optimized out>, argv=<optimized out>) at main.c:3268

@jimklimov jimklimov added this to the 2.8.5 milestone Mar 24, 2026
@jimklimov jimklimov added service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time labels Mar 24, 2026
@AppVeyorBot
Copy link
Copy Markdown

@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from fd80697 to 97eca57 Compare March 24, 2026 03:06
@AppVeyorBot
Copy link
Copy Markdown

jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…proctag() is called) [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…ally and flip to specified upsname later [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…sing setproctag() [networkupstools#3302, networkupstools#3368]

Did not work for parallel scanning threads where it would be most useful,
because they are in same process space...

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 24, 2026
…pthreads so far [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@AppVeyorBot
Copy link
Copy Markdown

@jimklimov jimklimov added the CI Entries related to continuous integration infrastructure (here CI = tools + scripts + recipes) label Mar 25, 2026
@jimklimov jimklimov added the portability We want NUT to build and run everywhere possible label Mar 25, 2026
@jimklimov jimklimov moved this to In Progress in NUT for Windows Mar 25, 2026
@jimklimov jimklimov moved this to In Progress in NUT QA Mar 25, 2026
@AppVeyorBot
Copy link
Copy Markdown

@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Mar 25, 2026

NOTE: After #3363 it seems that upsd quickly loops (without delay) over the expected connections while they are not established (e.g. drivers still starting). Need to check if it is a regression.

UPDATE: Older Windows builds did similarly (tested with 2.8.4.1572-1572+g69e282b3b+v2.8.5+rc5 and a small swarm of 50 drivers, to be under 64 connections):

  • when no drivers were started, upsd tried to find talkative sockets every 2 seconds, then slept between messages about "driver still not connected".
  • when at least one driver started, the loop was intensive
  • when all drivers stared, no low-verbosity messages appear in logs, but with bumped verbosity I can see the loop runs intensively to walk all connection handles (visually by the stream of text, about 1.2 cycles per second, with no time-skips/sleeps in the time-stamped log)
  • notably, dummy-ups has a chance to crash after an abrupt server disconnection in that older build too (the focus of this investigation)

Older Linux build (2.8.4.1541.9-1550+g7cd79ab73, with 3 dummy devices from NIT):

  • checked every 2 seconds when no drivers were started, but
  • when one driver was started, the loop reported a stack of adding FD handler and polling 3 filedescriptors; some stats: considered 6 connections, wanted to actually poll 3 and was constrained by maxconn=1024 and chunked by sysmaxconn=1024 3 times in the same tenth of a second, then one more after 0.1 second, and repeated after a whole second. Eventually settled on polling every second - I suppose the first burst was with the driver reporting the data it had, etc.
  • when eventually all drivers were started, there are a couple of polls per second, at repetitive but asymmetric (e.g. NUM.2 and NUM.5) timestamps.

jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…proctag() is called) [networkupstools#3302, networkupstools#3368]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov jimklimov force-pushed the issue-3302-drvconn branch from 5ca8690 to cf14d94 Compare March 25, 2026 17:31
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
jimklimov added a commit to jimklimov/nut that referenced this pull request Mar 25, 2026
…cognized, but pertain to another build configuration [networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…, or that it failed and we would sleep 0.1s [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…OSIX and WIN32, revise logging disconnect events for POSIX [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…spell defaults

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…nablement

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…_disconnect() implementation [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…etc. args for maintainability, update comments and log traces [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Mar 31, 2026

This all seems to work (even survives NUT_DEBUG_LEVEL=1 DUMMY_UPS_SWARM_COUNT=33 UPSLOG_SWARM_COUNT=33 time make check-NIT), but the fix ended up a big code change in the heart of the state machine.

Better include into the next release cycle, so the current one completes in finite time (at the cost of having a known rarely-triggered bug in v2.8.5).

…ing error codes; document the methods [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…pretation of "OK Goodbye" [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov
Copy link
Copy Markdown
Member Author

jimklimov commented Mar 31, 2026

This PR also introduces better NSS error reporting (older methods did not always work) and generally more legible logging messages in upsd:

...
Tue Mar 31 10:05:49 UTC 2026 [INFO] [testcase_sandbox_start_upsd_alone] Query listing from UPSD by UPSC (driver not running yet)
   3.083398     [D2:1188807:upsd] mainloop: polling returned 1 hits
   3.083456     [D3:1188807:upsd] mainloop: Incoming connection from SERVER [FD 3]
   3.083481     [D2:1188807:upsd] Connect from 127.0.0.1
...
   3.085990     [D2:1188807:upsd] mainloop: polling returned 1 hits
   3.086050     [D3:1188807:upsd] mainloop: Incoming data from CLIENT [127.0.0.1, FD 7]
   3.086072     [D6:1188807:upsd] Entering check_command: STARTTLS
   3.086116     [D6:1188807:upsd] check_command: Calling command handler for STARTTLS
   3.086228     [D2:1188807:upsd] write: [destfd=7] [len=12] [OK STARTTLS]
...
   3.119002     Client 127.0.0.1 did not provide any certificate while we require one.
   3.119050     [D1:1188495:upsd] nss_error -12285 (SSL_ERROR_NO_CERTIFICATE) in net_starttls / SSL_ForceHandshake : Unable to find the certificate or key necessary for authentication.
...
   3.156396     [D2:1188807:upsd] mainloop: polling returned 1 hits
   3.156411     [D3:1188807:upsd] mainloop: Disconnect CLIENT [127.0.0.1, FD 7] due to POLLHUP POLLERR: Success
   3.156422     [D2:1188807:upsd] Disconnect from 127.0.0.1

Although on the client side the error is not as visible:

   0.000127     [D1:1188516:upsc] Starting NUT client: upsc
   0.000139     [D1:1188516:upsc:libupsclient] upscli_init_default_connect_timeout: upscli_default_connect_timeout=10.000000 sec assigned from: envvar_secs
   0.000155     [D1:1188516:upsc:localhost:12345] upsname='(null)' hostname='localhost' port='12345'
   0.000781     [D1:1188516:upsc:libupsclient:localhost:12345] NUT_QUIET_INIT_SSL='false' value was not recognized, ignored
   0.000839     Init SSL without certificate database
   0.002457     Connecting in SSL to 'localhost' (no certificate name specified)
   0.056755     Do not intend to authenticate server localhost
   0.056933     Self-certificate name not configured
   0.057112     SSL handshake done successfully with server localhost
   0.057123     Connected to NUT server localhost in SSL
   0.057126     Certificate verification (by client) is disabled
   0.057133     [D1:1188516:upsc:localhost:12345] Calling list_upses()
   0.057296     [D1:1188516:upsc:libupsclient:localhost:12345] upscli_disconnect: We logged out, and server did not reply in a short time frame
   0.057370     [D1:1188516:upsc:localhost:12345] list_upses: got code -1, upserror 37
   0.057378     Error: SSL error #-5978, message too short to be displayed
   0.057627     [D1:1188516:upsc:localhost:12345] clean_exit: finished, exiting
[FATAL] [testcase_sandbox_start_upsd_alone] upsd does not respond on port 12345 (1):
Tue Mar 31 10:01:46 UTC 2026 [INFO] Stopping test daemons
   3.129906     mainloop: Interrupted system call
   3.230119     Signal 15: exiting

I was under the impression that the server would tell the client (maybe in plaintext ERR ... code?) what the trouble is. Not sure what #-5978 was - maybe mis-interpretation of plaintext bytes in what seems to be a TLS channel? Although with security involved, maybe that is too much already, rather than dropping the connection? Gotta investigate later what this is about => #3331

Maybe we should accept the attempt with any cert or lack thereof, just to drop it gracefully?

…ebug messages, update comments [networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…_ERR_SSLERR when (Open)SSL read/write fails during/despite retry loop [networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…k and report errno from select/read/write failures more diligently [networkupstools#3379, networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…or() from read/write failures more diligently, and handle "ERR*" replies with separate logging [networkupstools#3379, networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…t(): report "disconnecting" explicitly [networkupstools#3379]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…le their timing [networkupstools#3379, networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…ake() [networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…itialised SSL client [networkupstools#3331]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…e_timeout_may_disconnect() [networkupstools#3302]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
…ct timeout default [networkupstools#2847]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Entries related to continuous integration infrastructure (here CI = tools + scripts + recipes) Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time portability We want NUT to build and run everywhere possible service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug

Projects

Status: In Progress
Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants