Push intermediate checkpoints to HuggingFace Hub#766
Push intermediate checkpoints to HuggingFace Hub#766Rasaboun wants to merge 6 commits intoostris:mainfrom
Conversation
When push_to_hub_every_save is enabled, intermediate checkpoints are pushed to HuggingFace Hub as they are saved during training, not just at the end. This prevents losing work if training crashes before completion. Defaults to false so existing behavior is unchanged.
Prevents a crash if the HuggingFace upload fails after training completes. The model is already saved locally, so a failed push should log an error rather than crash the process.
Document the new option in all example YAML configs so users can discover it alongside the existing push_to_hub setting.
There was a problem hiding this comment.
Pull request overview
Adds support for pushing intermediate checkpoints to the HuggingFace Hub during training by introducing a new push_to_hub_every_save save option, and makes end-of-training Hub uploads non-fatal if they fail.
Changes:
- Add
push_to_hub_every_saveto the save config schema (backend + UI types) with a default offalse. - Implement mid-training Hub pushing via
BaseSDTrainProcess.post_save_hook, and wrap Hub pushes intry/exceptto avoid crashing on upload failure. - Update example training YAMLs to document the new option.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ui/src/types.ts | Adds push_to_hub_every_save to the UI SaveConfig type. |
| ui/src/app/jobs/new/jobConfig.ts | Adds default push_to_hub_every_save: false to the default job config. |
| toolkit/config_modules.py | Adds push_to_hub_every_save to backend SaveConfig. |
| jobs/process/BaseSDTrainProcess.py | Pushes to Hub after each save when enabled; wraps end-of-training push in try/except. |
| config/examples/train_lora_*.yaml | Documents the new push_to_hub_every_save option across examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
103e248 to
4f97aea
Compare
…d efficiency - Validate hf_repo_id is set when push_to_hub is enabled (fail-fast at config time) - Use non-blocking token check (HfFolder.get_token) for mid-training pushes instead of interactive interpreter_login which would block training - Upload only the new checkpoint file via upload_file() instead of rescanning the entire save_root folder on every intermediate save - Full folder upload with README still happens at end of training
4f97aea to
a59d3bc
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7629f34 to
463e115
Compare
…learer naming - Add traceback logging to all exception handlers for diagnosability - Distinguish transient vs permanent errors in create_repo (only disable on 401/403/404) - Catch HfHubHTTPError on uploads too, disable on auth/permission errors - Rename _hub_no_token to _hub_push_disabled for semantic clarity - Move get_token to top-level imports, remove unused Repository/HfFolder imports - Fix misleading comments and add prerequisite note to YAML configs - Backfill push_to_hub_every_save in migrateJobConfig for existing saved jobs
463e115 to
8c68999
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@jaretburkett What do you think about this feature ? |
The control_net branch saves to a directory (via save_pretrained) but file_path kept the .safetensors extension, so post_save_hook could never find the checkpoint and silently skipped every intermediate upload.
Summary
Adds a
push_to_hub_every_saveoption so intermediate checkpoints get pushed to HuggingFace Hub as they are saved, not just at the end of training.Right now if training crashes before completion you lose everything because Hub push only happens after the loop finishes. With this option enabled each checkpoint saved during training also gets uploaded.
Also wraps the end-of-training Hub push in a try/except so a failed upload no longer crashes the process — the model is already saved locally at that point.
Scope
Applies to all training processes that extend
BaseSDTrainProcess(LoRA, Slider, SD Rescale, etc.). VAE and ESRGAN training extendBaseTrainProcessdirectly and do not have Hub push support — that is pre-existing and unchanged by this PR.Usage
Defaults to
falseso existing behavior is unchanged. Requirespush_to_hub: trueandHF_TOKENin the environment (no interactive login prompt during training).Details
post_save_hookextension point inBaseSDTrainProcesswhich runs after every checkpoint save