diff --git a/docs/guides/checkpointing_solutions/emergency_checkpointing.md b/docs/guides/checkpointing_solutions/emergency_checkpointing.md index 356b3b2f51..00f9674de5 100644 --- a/docs/guides/checkpointing_solutions/emergency_checkpointing.md +++ b/docs/guides/checkpointing_solutions/emergency_checkpointing.md @@ -113,6 +113,16 @@ MaxText provides a set of configuration flags to control checkpointing options. | `local_checkpoint_period` | The interval, in training steps, for how often a **local checkpoint** is saved. This should be set to a much smaller value than `checkpoint_period` for frequent, low-overhead saves. | `integer` | `0` | | `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved to **persistent storage**. | `integer` | `10000` | | `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage. | `boolean` | `False` | +| `enable_autocheckpoint` | If `True`, enables saving a checkpoint when a preemption signal (SIGTERM) is received. This is a reactive mechanism that saves to persistent storage. | `boolean` | `False` | + +### Autocheckpoint vs. Emergency Checkpointing + +While both features aim to protect against progress loss, they operate differently: + +- **Autocheckpoint (`enable_autocheckpoint`)**: A **reactive** mechanism. When the infrastructure sends a `SIGTERM` signal (indicating imminent preemption or maintenance), MaxText immediately attempts to save a checkpoint to persistent storage (GCS). It is best for handling planned maintenance or preemptions where a short grace period is provided. +- **Emergency Checkpointing (`enable_emergency_checkpoint`)**: A **proactive** mechanism. It saves checkpoints very frequently to local, high-speed storage (ramdisk). If a failure occurs *without* warning, the job can recover from the most recent local checkpoint. It is best for handling sudden hardware failures. + +For maximum reliability, both features can be enabled simultaneously. ## Workload creation using XPK diff --git a/docs/reference/core_concepts/checkpoints.md b/docs/reference/core_concepts/checkpoints.md index f918cf6dbf..78e1ca972c 100644 --- a/docs/reference/core_concepts/checkpoints.md +++ b/docs/reference/core_concepts/checkpoints.md @@ -95,6 +95,7 @@ MaxText automatically saves checkpoints periodically during a training run. Thes Furthermore, MaxText supports emergency checkpointing, which saves a local copy of the checkpoint that can be restored quickly after an interruption. - `enable_emergency_checkpoint`: A boolean to enable or disable this feature. +- `enable_autocheckpoint`: A boolean to enable or disable saving a checkpoint when a preemption signal (SIGTERM) is received. - `local_checkpoint_directory`: The local path for storing emergency checkpoints. - `local_checkpoint_period`: The interval, in training steps, for saving local checkpoints.