Checkpointing State Saving and Loading (NGWPC-10159) by idtodd · Pull Request #167 · NGWPC/ngen

idtodd · 2026-04-06T15:14:09Z

Add checkpointing options for the NGEN simulation. These are controlled by the state_saving property of the realization config JSON. When running with checkpoints saving, when a checkpoint step is hit, a new subfolder with the name of the checkpoint will be generated, and model states will be added there. After successfully saving states, the folders of any prior checkpoint will be deleted. When loading checkpoints, NGEN will go through the subfolders of the checkpointing path in reverse numeric order and load from the first folder it finds with a complete state that can be loaded.

For saving checkpoints, an item must be added to the array that looks like

{
  "direction": "save",
  "label": "...", // used only for reporting
  "path": "...", // path to the root folder states will be saved to. A new subfolder will be created for each checkpoint
  "type": "FilePerUnit",
  "when": "Checkpoint",
  "frequency": 1000 // integer for how many steps run between each checkpoint made
}

For loading checkpoints, an item must be added to the array that looks like

{
  "direction": "load",
  "label": "...", // used only for reporting
  "path": "...", // path to the root folder checkpoint saves were saved to. The last subfolder with all required states will be selected when loading
  "type": "FilePerUnit",
  "when": "Checkpoint"
}

Additions

Parsing for checkpointing save and load configurations.
Methods on NgenSimluation and BMI interfaces for loading a checkpointing state. The main difference between checkpointing and hot start methods is there is not an additional step to reset the BMI's internal time after loading a state.
Methods on NgenSimulation and BMI interfaces for saving a checkpointing state. The main difference between checkpointing and end of run methods is additional data regarding the time and output state for NgenSimulation and Layer objects is saved.

Removals

Changes

Some generic save and load methods were renamed to specify they're used with checkpointing. Because checkpointing requires extra data regarding the current time of the simulation, separate state saving and loading methods will be maintained to determine what state needs to be preserved and whether additional processing needs to happen after a state is restored.

Testing

Screenshots

Notes

The current implementation will not preserve the catchment outputs (CSVs). There is ongoing discussion regarding the current catchment outputs (converting to NetCDF and creating periodic outputs instead of one lump output at the end primarily) that made implementing it now seem like wasted effort in the near future.

Todos

To account for delayed MPI messages from remote nexuses, an MPI_Barrier is run before saving the checkpointing. This could slow down the program, so an alternative approach would be to also store MPI messages that have not be received so they can be resent. This would likely be a very large effort to properly coordinate which messages are sent and received on checkpoint save and load, so we would probably only want to do this if we notice a significant performance impact from the MPI_Barrier use in checkpointing.

Checklist

Testing checklist (automated report can be put here)

Target Environment support

Linux

idtodd added 4 commits April 1, 2026 12:26

Initial framework for state save checkpointing

4b08b6b

Delete state data at least two checkpoints old

29f211f

Double sync processes to prevent deleting good checkpoint folder

ee7df47

Add MPI checkpoint load safety

79da0a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing State Saving and Loading (NGWPC-10159)#167

Checkpointing State Saving and Loading (NGWPC-10159)#167
idtodd wants to merge 4 commits intodevelopmentfrom
idt-save-state-checkpointing

idtodd commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

idtodd commented Apr 6, 2026

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist (automated report can be put here)

Target Environment support

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant