Bugfix seqlet assignment by LukasMahieu · Pull Request #44 · aertslab/TF-MINDI

LukasMahieu · 2026-04-13T14:58:51Z

The Bug

We noticed that over many TF-MINDI runs seqlets were often being missclassified as "GATA".

The root cause of this bug was twofold:

Wrong motif → DBD mapping. load_motif_to_dbd collapsed every motif's TF list to a single DBD via mode().iat[0]. For a true dimer motif like tfdimers__MD00378 (annotated
NFIC, NFIA, E2F1), this silently picked one family by alphabetical tiebreak instead of flagging it as a mixed-family composite.
Winner-takes-all per-seqlet assignment. Each seqlet's DBD was set from the single argmax motif in its TomTom row. One noisy top hit dominated, with no "I don't know" option.

For example:

	Direct_annot	Motif_similarity_annot	Orthology_annot	Motif_similarity_and_Orthology_annot
MotifID				
tfdimers__MD00226	MYB, ETS2	GATA3, OTX1, GATA4, IRF1, GATA2, NEUROD1, NFAT...	NaN	NaN
tfdimers__MD00378	NFIC, E2F1, NFIA	GATA3, PAX4, ZEB1, YY1, GATA4, USF1, GATA2, NF...	NaN	NaN
tfdimers__MD00564	NFIC, YY1, NFIA	CRX, GATA3, GATA4, USF1, GATA2, RAX, GATA5, NF...	NaN	NaN

These three dimers were all assigned to "GATA" , instead of prioritizing their Direct_annot.
We used to simply concatenate all these annotations and take a majority vote:

motif_to_tf = (
        motif_to_tf.apply(lambda row: ", ".join(row.dropna()), axis=1)
        .str.split(", ")
        .explode()
        .reset_index()
        .rename({0: "TF"}, axis=1)
    )

    # For each motif, take the most common (mode) DBD annotation
    motif_to_dbd = (
        motif_to_tf.dropna()
        .groupby("MotifID")["DBD"]
        .agg(lambda x: x.mode().iat[0])  # take the first mode if there's a tie
        .reset_index()
    )

In this scenario, since GATA appeared 6 times (GATA1, GATA2, ...), it always won these majority votes.

What changed

Seqlet assignment

To understand what changed it's easiest to inspect the new unit tests in tests/test_seqlet_assignment.py

For each motif, we walk the four annotation columns in priority order:

Direct_annot
Orthology_annot
Motif_similarity_annot
Motif_similarity_and_Orthology_annot

At each tier we do the same three steps:

Parse the comma-separated TF list in that cell (skip if empty/NaN).
Look up each TF's DBD family in the human TF table (TFs not in the table are silently skipped).
Collapse the result into a set of distinct families.

Then we decide:

Set size | Tier | Decision
0 (no TFs matched) | any | skip, try next tier
1 | any | label = that family, stop walking
>1 | Direct_annot | label = "Composite", stop walking
>1 | any lower tier | ambiguous — fall through to next tier

So tfdimers__MD00378 now correctly becomes Composite, and things like GATA1..GATA6 listed under similarity still collapse to GATA.

Per-seqlet DBD via top-K weighted vote (cluster.py)

Replaced the argmax assignment with a sparse-aware top-K weighted vote (new helper _vote_dbd). For each seqlet:

Take the top K motifs by similarity score.
Drop any that are NaN or Composite.
Sum the similarity scores per DBD family.
The winner must hold at least dbd_vote_min_share of the total; otherwise the seqlet is labelled NaN.

Two new optional kwargs on cluster_seqlets: top_k_motifs=5, dbd_vote_min_share=0.4.

Composite excluded from the cluster-level binomial background

Composite is folded into NaN in the background distribution, so the cluster-enrichment test can never promote it to a cluster label. NaN is kept as legitimate background mass (unknown motifs still occupy library slots).

No change to extract_seqlets, TomTom similarity, or cluster assignment itself.
No new dependencies; new kwargs are additive with sensible defaults, so no callsite changes required.

Example (PBMC)

Before Change

After Change
**We still see a GATA cluster, but instead of a noisy cluster near the center that does not get assigned a motif, it becomes a distinct cluster with clear GATA-like motifs. *
I think overall its a bit better, but probably not that many differences.
One Major difference though: at the SEQLET level, we will have wayy more NaN seqlets (in my case 17K / 47K seqlets). Perhaps some part of this workflow is a bit too strict.

…s --> orthology --> motif similarity

…s to a majority dbd with threshold

LukasMahieu added 4 commits April 13, 2026 15:08

change seqlet assignment to prioritize hierarchy of direct annotation…

fbf37d5

…s --> orthology --> motif similarity

add warning if annotations dont have unique names

18a925e

add a voting mechanism to cluster_dbd that assigns 'composite' seqlet…

1269d9c

…s to a majority dbd with threshold

add unit tests for new seqlet assignment mechanism

a664072

LukasMahieu requested a review from SeppeDeWinter April 13, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix seqlet assignment#44

Bugfix seqlet assignment#44
LukasMahieu wants to merge 4 commits intomainfrom
bugfix-seqlet-assignment

LukasMahieu commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LukasMahieu commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Bug

What changed

Seqlet assignment

Per-seqlet DBD via top-K weighted vote (cluster.py)

Composite excluded from the cluster-level binomial background

Example (PBMC)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LukasMahieu commented Apr 13, 2026 •

edited

Loading