Push intermediate checkpoints to HuggingFace Hub by Rasaboun · Pull Request #766 · ostris/ai-toolkit

Rasaboun · 2026-03-29T19:46:44Z

Summary

Adds a push_to_hub_every_save option so intermediate checkpoints get pushed to HuggingFace Hub as they are saved, not just at the end of training.

Right now if training crashes before completion you lose everything because Hub push only happens after the loop finishes. With this option enabled each checkpoint saved during training also gets uploaded.

Also wraps the end-of-training Hub push in a try/except so a failed upload no longer crashes the process — the model is already saved locally at that point.

Scope

Applies to all training processes that extend BaseSDTrainProcess (LoRA, Slider, SD Rescale, etc.). VAE and ESRGAN training extend BaseTrainProcess directly and do not have Hub push support — that is pre-existing and unchanged by this PR.

Usage

save:
  push_to_hub: true
  push_to_hub_every_save: true
  hf_repo_id: your-username/your-model

Defaults to false so existing behavior is unchanged. Requires push_to_hub: true and HF_TOKEN in the environment (no interactive login prompt during training).

Details

Uses the existing post_save_hook extension point in BaseSDTrainProcess which runs after every checkpoint save
Mid-training and end-of-training Hub pushes are wrapped in try/except to avoid crashing on upload failures
Adds the field to the UI TypeScript types and default job config
Documents the option in all example YAML configs

When push_to_hub_every_save is enabled, intermediate checkpoints are pushed to HuggingFace Hub as they are saved during training, not just at the end. This prevents losing work if training crashes before completion. Defaults to false so existing behavior is unchanged.

Prevents a crash if the HuggingFace upload fails after training completes. The model is already saved locally, so a failed push should log an error rather than crash the process.

Document the new option in all example YAML configs so users can discover it alongside the existing push_to_hub setting.

Copilot

Pull request overview

Adds support for pushing intermediate checkpoints to the HuggingFace Hub during training by introducing a new push_to_hub_every_save save option, and makes end-of-training Hub uploads non-fatal if they fail.

Changes:

Add push_to_hub_every_save to the save config schema (backend + UI types) with a default of false.
Implement mid-training Hub pushing via BaseSDTrainProcess.post_save_hook, and wrap Hub pushes in try/except to avoid crashing on upload failure.
Update example training YAMLs to document the new option.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
ui/src/types.ts	Adds `push_to_hub_every_save` to the UI SaveConfig type.
ui/src/app/jobs/new/jobConfig.ts	Adds default `push_to_hub_every_save: false` to the default job config.
toolkit/config_modules.py	Adds `push_to_hub_every_save` to backend `SaveConfig`.
jobs/process/BaseSDTrainProcess.py	Pushes to Hub after each save when enabled; wraps end-of-training push in `try/except`.
config/examples/train_lora_*.yaml	Documents the new `push_to_hub_every_save` option across examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

toolkit/config_modules.py

jobs/process/BaseSDTrainProcess.py

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jobs/process/BaseSDTrainProcess.py

…d efficiency - Validate hf_repo_id is set when push_to_hub is enabled (fail-fast at config time) - Use non-blocking token check (HfFolder.get_token) for mid-training pushes instead of interactive interpreter_login which would block training - Upload only the new checkpoint file via upload_file() instead of rescanning the entire save_root folder on every intermediate save - Full folder upload with README still happens at end of training

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jobs/process/BaseSDTrainProcess.py

ui/src/app/jobs/new/jobConfig.ts

…learer naming - Add traceback logging to all exception handlers for diagnosability - Distinguish transient vs permanent errors in create_repo (only disable on 401/403/404) - Catch HfHubHTTPError on uploads too, disable on auth/permission errors - Rename _hub_no_token to _hub_push_disabled for semantic clarity - Move get_token to top-level imports, remove unused Repository/HfFolder imports - Fix misleading comments and add prerequisite note to YAML configs - Backfill push_to_hub_every_save in migrateJobConfig for existing saved jobs

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jobs/process/BaseSDTrainProcess.py

Rasaboun · 2026-03-30T18:26:02Z

@jaretburkett What do you think about this feature ?

The control_net branch saves to a directory (via save_pretrained) but file_path kept the .safetensors extension, so post_save_hook could never find the checkpoint and silently skipped every intermediate upload.

Rasaboun added 3 commits March 29, 2026 21:44

Wrap end-of-training Hub push in try/except

27315a3

Prevents a crash if the HuggingFace upload fails after training completes. The model is already saved locally, so a failed push should log an error rather than crash the process.

Add push_to_hub_every_save to example configs

fe36faf

Document the new option in all example YAML configs so users can discover it alongside the existing push_to_hub setting.

Rasaboun marked this pull request as ready for review March 29, 2026 23:46

Copilot AI review requested due to automatic review settings March 29, 2026 23:46

Copilot started reviewing on behalf of Rasaboun March 29, 2026 23:46 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

toolkit/config_modules.py Outdated Show resolved Hide resolved

jobs/process/BaseSDTrainProcess.py Outdated Show resolved Hide resolved

jobs/process/BaseSDTrainProcess.py Outdated Show resolved Hide resolved

jobs/process/BaseSDTrainProcess.py Outdated Show resolved Hide resolved

Rasaboun requested a review from Copilot March 30, 2026 01:33

Copilot started reviewing on behalf of Rasaboun March 30, 2026 01:34 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

jobs/process/BaseSDTrainProcess.py Show resolved Hide resolved

Rasaboun force-pushed the feat/intermediate-hub-push branch from 103e248 to 4f97aea Compare March 30, 2026 01:43

Rasaboun force-pushed the feat/intermediate-hub-push branch from 4f97aea to a59d3bc Compare March 30, 2026 01:55

Rasaboun requested a review from Copilot March 30, 2026 02:12

Copilot started reviewing on behalf of Rasaboun March 30, 2026 02:12 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

jobs/process/BaseSDTrainProcess.py Show resolved Hide resolved

jobs/process/BaseSDTrainProcess.py Outdated Show resolved Hide resolved

ui/src/app/jobs/new/jobConfig.ts Show resolved Hide resolved

Rasaboun force-pushed the feat/intermediate-hub-push branch from 7629f34 to 463e115 Compare March 30, 2026 02:24

Rasaboun force-pushed the feat/intermediate-hub-push branch from 463e115 to 8c68999 Compare March 30, 2026 02:31

Rasaboun requested a review from Copilot March 30, 2026 02:31

Copilot started reviewing on behalf of Rasaboun March 30, 2026 02:32 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

jobs/process/BaseSDTrainProcess.py Show resolved Hide resolved

Fix checkpoint_path for control_net adapter Hub push

1bb359c

The control_net branch saves to a directory (via save_pretrained) but file_path kept the .safetensors extension, so post_save_hook could never find the checkpoint and silently skipped every intermediate upload.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Push intermediate checkpoints to HuggingFace Hub#766

Push intermediate checkpoints to HuggingFace Hub#766
Rasaboun wants to merge 6 commits intoostris:mainfrom
Rasaboun:feat/intermediate-hub-push

Rasaboun commented Mar 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Rasaboun commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Rasaboun commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Usage

Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Rasaboun commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rasaboun commented Mar 29, 2026 •

edited

Loading