Skip to content

Feature: dimer detector.#16

Open
SeppeDeWinter wants to merge 53 commits intomainfrom
dimer_detector
Open

Feature: dimer detector.#16
SeppeDeWinter wants to merge 53 commits intomainfrom
dimer_detector

Conversation

@SeppeDeWinter
Copy link
Copy Markdown
Collaborator

Add functionalities to detect dimers (and potentially multimers)

Example usage

Load pre-processed anndata object.

import logomaker  # type: ignore
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd  # type: ignore
import scanpy as sc  # type: ignore
import seaborn as sns  # type: ignore
from tqdm import tqdm  # type: ignore

import tfmindi as tm  # type: ignore
from tfmindi.tl import distance_bias  # type: ignore

with open("paper/sampled_motifs.txt") as f:
    motif_names = [line.strip() for line in f.readlines()]


motif_collection = tm.load_motif_collection(tm.fetch_motif_collection(), motif_names=motif_names)

motif_annotations = tm.load_motif_annotations(tm.fetch_motif_annotations())

motif_to_db = tm.load_motif_to_dbd(motif_annotations)

adata = tm.io.load_h5ad(
    <PATH_TO_ANNDATA>
)

In this dataset I know there are SOX dimers, however few are detected.

dbd_to_color = {
    dbd: plt.cm.tab20(i)  # type: ignore
    for i, dbd in enumerate(adata.obs["cluster_dbd"].value_counts().index)
}

dbd_to_color[np.nan] = "black"

fig, ax = plt.subplots(figsize=(16, 16))
ax.scatter(
    adata.obsm["X_tsne"][:, 0],
    adata.obsm["X_tsne"][:, 1],
    c=[dbd_to_color[dbd] for dbd in adata.obs["cluster_dbd"]],
    s=5,
)
ax.set_axis_off()
for dbd, color in dbd_to_color.items():
    ax.scatter([], [], color=color, label=dbd)
ax.legend()
fig.tight_layout()
fig.savefig("test_dist_bias_plots/seqlets_tsne_cluster_dbd.png", dpi=300)
image

Basically the HMG/Sox cluster in the upper right corner contains dimers at this point.

Dimers (or multimers, I call it fixed_distance_bias in general) are detected from patterns.
Note, the Tomtom procedure for generating patterns can take parts of sequences outside the actual called seqlet
(i.e. the patterns often look nice but do not always entirely represent the seqlet as shown in the tSNE). Using the k-mer approach only the seqlets themselves are considered. This is what I will use here. We can consider changing the tomtom approach to also do this.

In this step it is important to not use a subset of seqlets, only the seqlets in the patterns are considered for detecting distance bias.

patterns = tm.tl.create_patterns(adata, method="kmer")

Next we detect the distance bias. This is done by aggregating the contribution score across all seqlets within a pattern (these seqlets are already aligned to each other) and considering a window (20bp in this example) around the seqlet. In this window we will try to detect peaks corresponding to another TFBS that always occurs at a fixed distance relative to the pattern instances.

bias_detectors = {
    k: distance_bias.detect_fixed_distance_bias(adata=adata, pattern=pattern, window=20)
    for k, pattern in patterns.items()
}

for pattern in tqdm(bias_detectors):
    bias_detectors[pattern].detect_distance_bias()
    profile, pattern_location, peaks = bias_detectors[pattern].profile_plot_data
    fig, ax = plt.subplots()
    ax.plot(np.arange(profile.shape[0]), profile, color="black")
    for start_end in pattern_location:
        # plot location of the pattern
        ax.axvline(start_end, color="red")
    for peak_left, peak_right in peaks:
        # plot location of identified peaks
        ax.axvline(peak_left, color="orange")
        ax.axvline(peak_right, color="orange")
    fig.savefig(f"test_dist_bias_plots/profile_pattern_{pattern}.png", dpi=300)

This is an example of a pattern that has another TFBS near it. the location of the pattern instances are indicated by the red line, the called peak(s) are indicated using the orange line.

image

This is an example of a pattern that has no TFBS near it (at a fixed distance), in this case no peaks are called.

image

The contribution score per seqlet instance can also be visualised using a heatmap. This plot show the z-score (along each row) of the contribution score.

contribution, pattern_location, peaks = bias_detectors["0"].heatmap_plot_data
print(bias_detectors["0"].up_downstream_window)
# 9, 0 <---the instances of this pattern will be extended by 9 bp upstream in the next code block.

fig, ax = plt.subplots(figsize=(4, 10))
sns.heatmap(contribution, cmap="gray_r", robust=True, ax=ax, yticklabels=False, vmin=0.5)
for start_end in pattern_location:
    # plot location of the pattern
    ax.axvline(start_end, color="red")
for peak_left, peak_right in peaks:
    # plot location of identified peaks
    ax.axvline(peak_left, color="orange")
    ax.axvline(peak_right, color="orange")
fig.savefig("test_dist_bias_plots/heatmap_pattern_0.png", dpi=300)
image

The called peaks will be used to extend the seqlets. Also we will take care of removing overlapping seqlets after performing the extension (it might be that a dimer was called as two separate seqlets for example, we want only a single seqlet), overlaps are detected using ncls.

The function below will take care of this. An important parameter in this function is the threshold value. This decides for each seqlet instance whether or not another binding site is near it. For this the maximum z-score within the detected peak window (orange lines above) is considered. Note that I put vmin=0.5, this same value will be used as threshold.

I also add an additional 10bp flanks to each called seqlet, this helps with generating nice patterns later-on.

new_seqlet_df, new_seqlet_matrices = distance_bias.create_seqlet_matrices_with_distance_bias(
    adata=adata, fixed_distance_bias_detectors=bias_detectors.values(), threshold=0.5, extra_flanks_to_add=10
)

Now we can generate a new similarity matrix and perform clustering again.

new_similarity_matrix = tm.pp.calculate_motif_similarity(new_seqlet_matrices, motif_collection, chunk_size=10000)

# Create AnnData object for analysis
new_adata = tm.pp.create_seqlet_adata(
    new_similarity_matrix,
    new_seqlet_df,
    seqlet_matrices=new_seqlet_matrices,
    oh_sequences=adata.uns["unique_examples"]["oh"],
    contrib_scores=adata.uns["unique_examples"]["contrib"],
    motif_collection=motif_collection,
    motif_annotations=motif_annotations,
    motif_to_dbd=motif_to_db,
)

sc.tl.pca(
    new_adata
)  # running pca with default params is much faster, will update the code to allow the user to choose the algorithm.
tm.tl.cluster_seqlets(new_adata, resolution=3)

dbd_to_color = {
    dbd: plt.cm.tab20(i)  # type: ignore
    for i, dbd in enumerate(new_adata.obs["cluster_dbd"].value_counts().index)
}

dbd_to_color[np.nan] = "black"

fig, ax = plt.subplots(figsize=(16, 16))
ax.scatter(
    new_adata.obsm["X_tsne"][:, 0],
    new_adata.obsm["X_tsne"][:, 1],
    c=[dbd_to_color[dbd] for dbd in new_adata.obs["cluster_dbd"]],
    s=5,
)
ax.set_axis_off()
for dbd, color in dbd_to_color.items():
    ax.scatter([], [], color=color, label=dbd)
ax.legend()
fig.tight_layout()
fig.savefig("test_dist_bias_plots/seqlets_tsne_cluster_dbd_w_dimer.png", dpi=300)

In this plot I annotated Sox dimers (dark blue) manually by inspecting the patterns. Note that we have many more dimers now.

image

@LukasMahieu
Copy link
Copy Markdown
Collaborator

Nice, these look good. I'm testing on the full tutorial data and seems to work there too.
I'll write a small tutorial with the text from this PR. I'll also add the plots you created here as functions to the package in tfmindi.pl.
I do wonder if the class design is the cleanest approach here though. Maybe a more functional approach with a dataclass to store the results in would make this a bit more readable (e.g. like we have a Pattern and Seqlet we could have a DistanceBias object)?

What do you think? I could make these changes tomorrow.

Something like this

import tfmindi as tm

adata = tm.load_h5ad("seqlets_clustered.h5ad")
patterns = tm.tl.create_patterns(adata, method="kmer")

# detect distance bias
bias_results = {}
for k, pattern in patterns.items():
    bias_results[k] = tm.tl.detect_distance_bias(
        adata=adata,
        pattern=pattern,
        window=20,
        height=0.25
    ) # returns DistanceBias objects

# add plotting functions
for pattern_id, result in bias_results.items():
    if result.has_bias:
        # Profile plot
        fig1 = tm.pl.distance_bias_profile(result, title=f"Pattern {pattern_id}")

        # Heatmap plot  
        fig2 = tm.pl.distance_bias_heatmap(result, figsize=(4, 10))

# extend seqlets
results_with_bias = [r for r in bias_results.values() if r.has_bias]
new_seqlets_df, new_seqlet_matrices = tm.tl.extend_seqlets_with_bias(
    adata=adata,
    bias_results=results_with_bias,
    threshold=0.5,
    extra_flanks=10
)

@SeppeDeWinter
Copy link
Copy Markdown
Collaborator Author

Yep makes sense too, I guess that fits better with the reset of the package indeed.

@LukasMahieu
Copy link
Copy Markdown
Collaborator

@SeppeDeWinter did the above mentioned mini-refactor and wrote the tutorial (see docs/notebooks for the tutorial).
Also, I added one function pl.pattern_logo to plot the logo of a single pattern since I needed this for the tutorial.

One thing before this can be merged: at the end of that tutorial I perform clustering as well as pattern detection again and plot the results. I would have hoped that the extended seqlets would still be detected as a SOX motif, and that the full SOX dimer would now show up in the pattern logo plogo. However, this doesn't seem to be the case: we still get a separate SOX monomer showing up in the plot and no dimer. Any idea why this is the case? Is this because I did not perform manual annotation and there is no SOX-dimer in the motif collection, so our new sox-dimer motifs get assigned to an "incorrect" cluster?

@LukasMahieu
Copy link
Copy Markdown
Collaborator

@SeppeDeWinter I tried a bunch of different things but I'm still stuck here, so would be good if you can take another crack at it when you have time (see the tutorial notebook for what goes wrong).
I went through the code and to me everything seems to be correct (the correct seqlets get extended in the correct direction).
Yet, after calculating motif similarities and re-clustering, I'm back to the beginning. I tried different clustering resolutions and extra flanks when extending seqlets, yet that doesn't seem to help in finding the dimer...

@SeppeDeWinter
Copy link
Copy Markdown
Collaborator Author

@LukasMahieu I think your threshold value was too stringent.

Running, using vmin=1 (the default threshold)

tm.pl.distance_bias_heatmap(result, title=f"Pattern {pattern_id} - {patterns[pattern_id].dbd}", x_label_rotation=90, vmin = 1)
image

there is very low signal in the "orange peak" .

this is with vmin=0 instead

image

Running extend_biased_seqlets with threshold set to 0 does produce dimer motifs.
I have modified the code in the notebook to only return the seqlets that get modified (for easier debugging). I will send you this notebook on Slack.

These are the detected SOX motifs

image

--> many SOX dimer.

That being said, we might consider not thresholding at all and just extending all seqlet instances for the clusters for which we detect the distance bias?

I have not tried reclustering all the seqlets, but I suppose that will also work (will try now)

LukasMahieu and others added 6 commits October 24, 2025 11:06
filter legend based on colors in adata.obs[color_by]
- concatentation of TF-MInDi anndata objects while preserving adata.var and adata.uns["unique_examples"]
- Has option `idx_match` so user can specify whether index columns in adata.obs ("example_oh_idx", "example_contrib_idx", "example_idx") refer to the same data across adatas or not

Related to issue:

Better user experience for concatenating multiple TF-MInDi objects.
Fixes #22
LukasMahieu and others added 5 commits November 10, 2025 10:17
Update tfmindi description and add overview figure
- concatentation of TF-MInDi anndata objects while preserving adata.var and adata.uns["unique_examples"]
- Has option `idx_match` so user can specify whether index columns in adata.obs ("example_oh_idx", "example_contrib_idx", "example_idx") refer to the same data across adatas or not

Related to issue:

Better user experience for concatenating multiple TF-MInDi objects.
Fixes #22
SeppeDeWinter and others added 30 commits November 19, 2025 19:14
Replace MIT License with Academic Non-commercial License
change logo paths and add logo to readthedocs
change heights logos and fix build
- concatentation of TF-MInDi anndata objects while preserving adata.var and adata.uns["unique_examples"]
- Has option `idx_match` so user can specify whether index columns in adata.obs ("example_oh_idx", "example_contrib_idx", "example_idx") refer to the same data across adatas or not

Related to issue:

Better user experience for concatenating multiple TF-MInDi objects.
Fixes #22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants