Skip to content

[Bug] [SLURM] sbatch errors hang executor #959

@pmrv

Description

@pmrv

If a user (ie. me :() e.g. makes a mistake in the resource_dict when using SlurmClusterExecutor, then sbatch will crash which will make the submitting thread of executorlib print a stacktrace saying so, but without the output from sbatch. That's awkward to debug, because afaict executorlib also doesn't write the output to a logfile. So to debug this situtation, I end up going manually to the executor cache directory, sbatch'ing the run script and then taking the log from there to debug. If executorlib would print this directly, it'd be much more helpful.

There's also a tangential issue, where if the situation occurs, but I'm wait()'ing on a future the executor returns, my notebook hangs, not getting interrupted by the failure inside the executor. This kinda annoying because if I manually interrupt my kernel at this point, the whole kernel crashes. I'm not sure how much control executorlib has over this part though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions