Skip to content

Bugfix seqlet assignment#44

Open
LukasMahieu wants to merge 4 commits intomainfrom
bugfix-seqlet-assignment
Open

Bugfix seqlet assignment#44
LukasMahieu wants to merge 4 commits intomainfrom
bugfix-seqlet-assignment

Conversation

@LukasMahieu
Copy link
Copy Markdown
Collaborator

@LukasMahieu LukasMahieu commented Apr 13, 2026

The Bug

We noticed that over many TF-MINDI runs seqlets were often being missclassified as "GATA".

The root cause of this bug was twofold:

  1. Wrong motif → DBD mapping. load_motif_to_dbd collapsed every motif's TF list to a single DBD via mode().iat[0]. For a true dimer motif like tfdimers__MD00378 (annotated
    NFIC, NFIA, E2F1), this silently picked one family by alphabetical tiebreak instead of flagging it as a mixed-family composite.
  2. Winner-takes-all per-seqlet assignment. Each seqlet's DBD was set from the single argmax motif in its TomTom row. One noisy top hit dominated, with no "I don't know" option.

For example:

	Direct_annot	Motif_similarity_annot	Orthology_annot	Motif_similarity_and_Orthology_annot
MotifID				
tfdimers__MD00226	MYB, ETS2	GATA3, OTX1, GATA4, IRF1, GATA2, NEUROD1, NFAT...	NaN	NaN
tfdimers__MD00378	NFIC, E2F1, NFIA	GATA3, PAX4, ZEB1, YY1, GATA4, USF1, GATA2, NF...	NaN	NaN
tfdimers__MD00564	NFIC, YY1, NFIA	CRX, GATA3, GATA4, USF1, GATA2, RAX, GATA5, NF...	NaN	NaN    

These three dimers were all assigned to "GATA" , instead of prioritizing their Direct_annot.
We used to simply concatenate all these annotations and take a majority vote:

motif_to_tf = (
        motif_to_tf.apply(lambda row: ", ".join(row.dropna()), axis=1)
        .str.split(", ")
        .explode()
        .reset_index()
        .rename({0: "TF"}, axis=1)
    )

    # For each motif, take the most common (mode) DBD annotation
    motif_to_dbd = (
        motif_to_tf.dropna()
        .groupby("MotifID")["DBD"]
        .agg(lambda x: x.mode().iat[0])  # take the first mode if there's a tie
        .reset_index()
    )

In this scenario, since GATA appeared 6 times (GATA1, GATA2, ...), it always won these majority votes.

What changed

Seqlet assignment

To understand what changed it's easiest to inspect the new unit tests in tests/test_seqlet_assignment.py

For each motif, we walk the four annotation columns in priority order:

  1. Direct_annot
  2. Orthology_annot
  3. Motif_similarity_annot
  4. Motif_similarity_and_Orthology_annot

At each tier we do the same three steps:

  1. Parse the comma-separated TF list in that cell (skip if empty/NaN).
  2. Look up each TF's DBD family in the human TF table (TFs not in the table are silently skipped).
  3. Collapse the result into a set of distinct families.

Then we decide:

Set size | Tier | Decision
0 (no TFs matched) | any | skip, try next tier
1 | any | label = that family, stop walking
>1 | Direct_annot | label = "Composite", stop walking
>1 | any lower tier | ambiguous — fall through to next tier

So tfdimers__MD00378 now correctly becomes Composite, and things like GATA1..GATA6 listed under similarity still collapse to GATA.

Per-seqlet DBD via top-K weighted vote (cluster.py)

Replaced the argmax assignment with a sparse-aware top-K weighted vote (new helper _vote_dbd). For each seqlet:

  • Take the top K motifs by similarity score.
  • Drop any that are NaN or Composite.
  • Sum the similarity scores per DBD family.
  • The winner must hold at least dbd_vote_min_share of the total; otherwise the seqlet is labelled NaN.

Two new optional kwargs on cluster_seqlets: top_k_motifs=5, dbd_vote_min_share=0.4.

Composite excluded from the cluster-level binomial background

Composite is folded into NaN in the background distribution, so the cluster-enrichment test can never promote it to a cluster label. NaN is kept as legitimate background mass (unknown motifs still occupy library slots).

  • No change to extract_seqlets, TomTom similarity, or cluster assignment itself.
  • No new dependencies; new kwargs are additive with sensible defaults, so no callsite changes required.

Example (PBMC)

  1. Before Change
image image image
  1. After Change
    **We still see a GATA cluster, but instead of a noisy cluster near the center that does not get assigned a motif, it becomes a distinct cluster with clear GATA-like motifs. *
    I think overall its a bit better, but probably not that many differences.
    One Major difference though: at the SEQLET level, we will have wayy more NaN seqlets (in my case 17K / 47K seqlets). Perhaps some part of this workflow is a bit too strict.
image image image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant