Add VAE-based dimensionality reduction to cluster_seqlets#45
Open
Add VAE-based dimensionality reduction to cluster_seqlets#45
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a non-linear dimensionality reduction option (
reduction="vae") tocluster_seqletsas an alternative to PCA. Everything downstream (neighbour graph, t-SNE, Leiden, DBD annotation) is unchanged.Motivation
PCA is linear and may not capture the full structure of seqlet contribution score matrices, particularly when patterns lie on a non-linear manifold. A β-VAE learns a compact latent representation that can better separate motif families before Leiden clustering.
Changes
src/tfmindi/tl/vae.py(new file)fit_vae_latents(X, ...)— trains a β-VAE on the seqlet similarity matrix and returns the posterior means as a(n_seqlets, latent_dim)float32 arraybetais scale-independent oflatent_dimlogvarclamped to[-4, 15]to prevent overflow early in trainingtorch.manual_seedseeded for full reproducibilityGradScalerenabled automatically on CUDA, silently disabled on CPUtorchimport — no hard dependency unlessreduction="vae"is actually usedsrc/tfmindi/tl/cluster.pyreduction: str = "pca"andvae_kwargs: dict | None = Noneparameters (keyword-only, fully backwards compatible)adata.obsm["X_vae"]and passes it asuse_repto neighbours and t-SNE, consistent with howX_pcais handledrecompute=Falseskips VAE training ifX_vaealready present, same caching behaviour as PCA_using_gpuflag:device="cuda"when the GPU backend is active,"auto"otherwisepyproject.toml[vae]optional dependency group:torch>=2.0, installable viapip install tfmindi[vae]Tested
Run on a combined human + mouse dataset of 1,149,067 seqlets × 17,995 features on CPU:
For reference, PCA on the same dataset took 1h 9min (the VAE was run for only 15 epochs here as a proof of concept; 50 is the recommended default). GPU execution is implemented and should work but has not been tested.
Usage
Known limitations / future work
latent_dim,beta,epochs) currently require manual tuning; automated hyperparameter search could be added in a follow-up