Open
Conversation
…s --> orthology --> motif similarity
…s to a majority dbd with threshold
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Bug
We noticed that over many TF-MINDI runs seqlets were often being missclassified as "GATA".
The root cause of this bug was twofold:
NFIC, NFIA, E2F1), this silently picked one family by alphabetical tiebreak instead of flagging it as a mixed-family composite.
For example:
These three dimers were all assigned to "GATA" , instead of prioritizing their Direct_annot.
We used to simply concatenate all these annotations and take a majority vote:
In this scenario, since GATA appeared 6 times (GATA1, GATA2, ...), it always won these majority votes.
What changed
Seqlet assignment
To understand what changed it's easiest to inspect the new unit tests in tests/test_seqlet_assignment.py
For each motif, we walk the four annotation columns in priority order:
At each tier we do the same three steps:
Then we decide:
So tfdimers__MD00378 now correctly becomes Composite, and things like GATA1..GATA6 listed under similarity still collapse to GATA.
Per-seqlet DBD via top-K weighted vote (cluster.py)
Replaced the argmax assignment with a sparse-aware top-K weighted vote (new helper _vote_dbd). For each seqlet:
Two new optional kwargs on cluster_seqlets: top_k_motifs=5, dbd_vote_min_share=0.4.
Composite excluded from the cluster-level binomial background
Composite is folded into NaN in the background distribution, so the cluster-enrichment test can never promote it to a cluster label. NaN is kept as legitimate background mass (unknown motifs still occupy library slots).
Example (PBMC)
**We still see a GATA cluster, but instead of a noisy cluster near the center that does not get assigned a motif, it becomes a distinct cluster with clear GATA-like motifs. *
I think overall its a bit better, but probably not that many differences.
One Major difference though: at the SEQLET level, we will have wayy more NaN seqlets (in my case 17K / 47K seqlets). Perhaps some part of this workflow is a bit too strict.