-
Notifications
You must be signed in to change notification settings - Fork 6
[Bug] [SLURM] sbatch errors hang executor #959
Description
If a user (ie. me :() e.g. makes a mistake in the resource_dict when using SlurmClusterExecutor, then sbatch will crash which will make the submitting thread of executorlib print a stacktrace saying so, but without the output from sbatch. That's awkward to debug, because afaict executorlib also doesn't write the output to a logfile. So to debug this situtation, I end up going manually to the executor cache directory, sbatch'ing the run script and then taking the log from there to debug. If executorlib would print this directly, it'd be much more helpful.
There's also a tangential issue, where if the situation occurs, but I'm wait()'ing on a future the executor returns, my notebook hangs, not getting interrupted by the failure inside the executor. This kinda annoying because if I manually interrupt my kernel at this point, the whole kernel crashes. I'm not sure how much control executorlib has over this part though.