Skip to content

Add VAE-based dimensionality reduction to cluster_seqlets#45

Open
DadaAb wants to merge 1 commit intomainfrom
dim_reduction_with_vae
Open

Add VAE-based dimensionality reduction to cluster_seqlets#45
DadaAb wants to merge 1 commit intomainfrom
dim_reduction_with_vae

Conversation

@DadaAb
Copy link
Copy Markdown
Collaborator

@DadaAb DadaAb commented Apr 17, 2026

Summary

Adds a non-linear dimensionality reduction option (reduction="vae") to cluster_seqlets as an alternative to PCA. Everything downstream (neighbour graph, t-SNE, Leiden, DBD annotation) is unchanged.

Motivation

PCA is linear and may not capture the full structure of seqlet contribution score matrices, particularly when patterns lie on a non-linear manifold. A β-VAE learns a compact latent representation that can better separate motif families before Leiden clustering.

Changes

src/tfmindi/tl/vae.py (new file)

  • Implements fit_vae_latents(X, ...) — trains a β-VAE on the seqlet similarity matrix and returns the posterior means as a (n_seqlets, latent_dim) float32 array
  • MLP encoder/decoder with ReLU + Dropout; reparameterisation trick for sampling
  • KL summed over latent dim then averaged over batch, so beta is scale-independent of latent_dim
  • logvar clamped to [-4, 15] to prevent overflow early in training
  • Both numpy and torch.manual_seed seeded for full reproducibility
  • AMP + GradScaler enabled automatically on CUDA, silently disabled on CPU
  • Lazy torch import — no hard dependency unless reduction="vae" is actually used

src/tfmindi/tl/cluster.py

  • Added reduction: str = "pca" and vae_kwargs: dict | None = None parameters (keyword-only, fully backwards compatible)
  • VAE branch stores embedding in adata.obsm["X_vae"] and passes it as use_rep to neighbours and t-SNE, consistent with how X_pca is handled
  • recompute=False skips VAE training if X_vae already present, same caching behaviour as PCA
  • Device selection respects the existing _using_gpu flag: device="cuda" when the GPU backend is active, "auto" otherwise

pyproject.toml

  • Added [vae] optional dependency group: torch>=2.0, installable via pip install tfmindi[vae]

Tested

Run on a combined human + mouse dataset of 1,149,067 seqlets × 17,995 features on CPU:

Training VAE on cpu (latent_dim=10, epochs=15, beta=0.01)...
  epoch   1/15  loss=0.9186  recon=0.9040  kl=1.4568
  epoch   5/15  loss=0.4835  recon=0.3945  kl=8.9024
  epoch  10/15  loss=0.4511  recon=0.3585  kl=9.2634
  epoch  15/15  loss=0.4377  recon=0.3428  kl=9.4936
VAE embedding complete. Shape: (1149067, 10)
Computing neighborhood graph (use_rep='X_vae')...  [1min 58s]
Computing t-SNE embedding (use_rep='X_vae')...     [1h 12min]

For reference, PCA on the same dataset took 1h 9min (the VAE was run for only 15 epochs here as a proof of concept; 50 is the recommended default). GPU execution is implemented and should work but has not been tested.

Usage

# Default behaviour unchanged
tm.tl.cluster_seqlets(adata, resolution=3.0)

# VAE alternative
tm.tl.cluster_seqlets(adata, resolution=3.0, reduction="vae")

# With custom settings
tm.tl.cluster_seqlets(
    adata,
    resolution=3.0,
    reduction="vae",
    vae_kwargs=dict(latent_dim=10, epochs=50, beta=0.1),
)

# Re-cluster at a different resolution — VAE training is skipped (X_vae cached)
tm.tl.cluster_seqlets(adata, resolution=5.0, reduction="vae")

Known limitations / future work

  • GPU path is implemented but untested — feedback welcome
  • VAE hyperparameters (latent_dim, beta, epochs) currently require manual tuning; automated hyperparameter search could be added in a follow-up
  • No quantitative comparison of cluster quality between PCA and VAE is included yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant