Skip to content

Push intermediate checkpoints to HuggingFace Hub#766

Open
Rasaboun wants to merge 6 commits intoostris:mainfrom
Rasaboun:feat/intermediate-hub-push
Open

Push intermediate checkpoints to HuggingFace Hub#766
Rasaboun wants to merge 6 commits intoostris:mainfrom
Rasaboun:feat/intermediate-hub-push

Conversation

@Rasaboun
Copy link
Copy Markdown
Contributor

@Rasaboun Rasaboun commented Mar 29, 2026

Summary

Adds a push_to_hub_every_save option so intermediate checkpoints get pushed to HuggingFace Hub as they are saved, not just at the end of training.

Right now if training crashes before completion you lose everything because Hub push only happens after the loop finishes. With this option enabled each checkpoint saved during training also gets uploaded.

Also wraps the end-of-training Hub push in a try/except so a failed upload no longer crashes the process — the model is already saved locally at that point.

Scope

Applies to all training processes that extend BaseSDTrainProcess (LoRA, Slider, SD Rescale, etc.). VAE and ESRGAN training extend BaseTrainProcess directly and do not have Hub push support — that is pre-existing and unchanged by this PR.

Usage

save:
  push_to_hub: true
  push_to_hub_every_save: true
  hf_repo_id: your-username/your-model

Defaults to false so existing behavior is unchanged. Requires push_to_hub: true and HF_TOKEN in the environment (no interactive login prompt during training).

Details

  • Uses the existing post_save_hook extension point in BaseSDTrainProcess which runs after every checkpoint save
  • Mid-training and end-of-training Hub pushes are wrapped in try/except to avoid crashing on upload failures
  • Adds the field to the UI TypeScript types and default job config
  • Documents the option in all example YAML configs

When push_to_hub_every_save is enabled, intermediate checkpoints are
pushed to HuggingFace Hub as they are saved during training, not just
at the end. This prevents losing work if training crashes before
completion.

Defaults to false so existing behavior is unchanged.
Prevents a crash if the HuggingFace upload fails after training
completes. The model is already saved locally, so a failed push
should log an error rather than crash the process.
Document the new option in all example YAML configs so users
can discover it alongside the existing push_to_hub setting.
@Rasaboun Rasaboun marked this pull request as ready for review March 29, 2026 23:46
Copilot AI review requested due to automatic review settings March 29, 2026 23:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for pushing intermediate checkpoints to the HuggingFace Hub during training by introducing a new push_to_hub_every_save save option, and makes end-of-training Hub uploads non-fatal if they fail.

Changes:

  • Add push_to_hub_every_save to the save config schema (backend + UI types) with a default of false.
  • Implement mid-training Hub pushing via BaseSDTrainProcess.post_save_hook, and wrap Hub pushes in try/except to avoid crashing on upload failure.
  • Update example training YAMLs to document the new option.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ui/src/types.ts Adds push_to_hub_every_save to the UI SaveConfig type.
ui/src/app/jobs/new/jobConfig.ts Adds default push_to_hub_every_save: false to the default job config.
toolkit/config_modules.py Adds push_to_hub_every_save to backend SaveConfig.
jobs/process/BaseSDTrainProcess.py Pushes to Hub after each save when enabled; wraps end-of-training push in try/except.
config/examples/train_lora_*.yaml Documents the new push_to_hub_every_save option across examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Rasaboun Rasaboun force-pushed the feat/intermediate-hub-push branch from 103e248 to 4f97aea Compare March 30, 2026 01:43
…d efficiency

- Validate hf_repo_id is set when push_to_hub is enabled (fail-fast at config time)
- Use non-blocking token check (HfFolder.get_token) for mid-training pushes
  instead of interactive interpreter_login which would block training
- Upload only the new checkpoint file via upload_file() instead of rescanning
  the entire save_root folder on every intermediate save
- Full folder upload with README still happens at end of training
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Rasaboun Rasaboun force-pushed the feat/intermediate-hub-push branch from 7629f34 to 463e115 Compare March 30, 2026 02:24
…learer naming

- Add traceback logging to all exception handlers for diagnosability
- Distinguish transient vs permanent errors in create_repo (only disable on 401/403/404)
- Catch HfHubHTTPError on uploads too, disable on auth/permission errors
- Rename _hub_no_token to _hub_push_disabled for semantic clarity
- Move get_token to top-level imports, remove unused Repository/HfFolder imports
- Fix misleading comments and add prerequisite note to YAML configs
- Backfill push_to_hub_every_save in migrateJobConfig for existing saved jobs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Rasaboun
Copy link
Copy Markdown
Contributor Author

@jaretburkett What do you think about this feature ?

The control_net branch saves to a directory (via save_pretrained) but
file_path kept the .safetensors extension, so post_save_hook could
never find the checkpoint and silently skipped every intermediate upload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants