Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/guides/checkpointing_solutions/emergency_checkpointing.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,16 @@ MaxText provides a set of configuration flags to control checkpointing options.
| `local_checkpoint_period` | The interval, in training steps, for how often a **local checkpoint** is saved. This should be set to a much smaller value than `checkpoint_period` for frequent, low-overhead saves. | `integer` | `0` |
| `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved to **persistent storage**. | `integer` | `10000` |
| `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage. | `boolean` | `False` |
| `enable_autocheckpoint` | If `True`, enables saving a checkpoint when a preemption signal (SIGTERM) is received. This is a reactive mechanism that saves to persistent storage. | `boolean` | `False` |

### Autocheckpoint vs. Emergency Checkpointing

While both features aim to protect against progress loss, they operate differently:

- **Autocheckpoint (`enable_autocheckpoint`)**: A **reactive** mechanism. When the infrastructure sends a `SIGTERM` signal (indicating imminent preemption or maintenance), MaxText immediately attempts to save a checkpoint to persistent storage (GCS). It is best for handling planned maintenance or preemptions where a short grace period is provided.
- **Emergency Checkpointing (`enable_emergency_checkpoint`)**: A **proactive** mechanism. It saves checkpoints very frequently to local, high-speed storage (ramdisk). If a failure occurs *without* warning, the job can recover from the most recent local checkpoint. It is best for handling sudden hardware failures.

For maximum reliability, both features can be enabled simultaneously.

## Workload creation using XPK

Expand Down
1 change: 1 addition & 0 deletions docs/reference/core_concepts/checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ MaxText automatically saves checkpoints periodically during a training run. Thes
Furthermore, MaxText supports emergency checkpointing, which saves a local copy of the checkpoint that can be restored quickly after an interruption.

- `enable_emergency_checkpoint`: A boolean to enable or disable this feature.
- `enable_autocheckpoint`: A boolean to enable or disable saving a checkpoint when a preemption signal (SIGTERM) is received.
- `local_checkpoint_directory`: The local path for storing emergency checkpoints.
- `local_checkpoint_period`: The interval, in training steps, for saving local checkpoints.

Expand Down
Loading