Checkpointing State Saving and Loading (NGWPC-10159)#167
Draft
idtodd wants to merge 4 commits intodevelopmentfrom
Draft
Checkpointing State Saving and Loading (NGWPC-10159)#167idtodd wants to merge 4 commits intodevelopmentfrom
idtodd wants to merge 4 commits intodevelopmentfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add checkpointing options for the NGEN simulation. These are controlled by the
state_savingproperty of the realization config JSON. When running with checkpoints saving, when a checkpoint step is hit, a new subfolder with the name of the checkpoint will be generated, and model states will be added there. After successfully saving states, the folders of any prior checkpoint will be deleted. When loading checkpoints, NGEN will go through the subfolders of the checkpointing path in reverse numeric order and load from the first folder it finds with a complete state that can be loaded.For saving checkpoints, an item must be added to the array that looks like
{ "direction": "save", "label": "...", // used only for reporting "path": "...", // path to the root folder states will be saved to. A new subfolder will be created for each checkpoint "type": "FilePerUnit", "when": "Checkpoint", "frequency": 1000 // integer for how many steps run between each checkpoint made }For loading checkpoints, an item must be added to the array that looks like
{ "direction": "load", "label": "...", // used only for reporting "path": "...", // path to the root folder checkpoint saves were saved to. The last subfolder with all required states will be selected when loading "type": "FilePerUnit", "when": "Checkpoint" }Additions
NgenSimluationand BMI interfaces for loading a checkpointing state. The main difference between checkpointing and hot start methods is there is not an additional step to reset the BMI's internal time after loading a state.NgenSimulationand BMI interfaces for saving a checkpointing state. The main difference between checkpointing and end of run methods is additional data regarding the time and output state forNgenSimulationandLayerobjects is saved.Removals
Changes
Testing
Screenshots
Notes
Todos
MPI_Barrieris run before saving the checkpointing. This could slow down the program, so an alternative approach would be to also store MPI messages that have not be received so they can be resent. This would likely be a very large effort to properly coordinate which messages are sent and received on checkpoint save and load, so we would probably only want to do this if we notice a significant performance impact from theMPI_Barrieruse in checkpointing.Checklist
Testing checklist (automated report can be put here)
Target Environment support