Metadarkmatter

ANI-weighted placement uncertainty for detecting novel microbial diversity in environmental eDNA samples.

Overview

Metadarkmatter identifies novel bacterial taxa ("microbial dark matter") in metagenomic data using competitive read recruitment with ANI-weighted placement uncertainty. The tool distinguishes between true novelty and taxonomic ambiguity by analyzing both sequence divergence and phylogenetic placement confidence.

Core Innovation: Combines BLAST alignment with Average Nucleotide Identity (ANI) matrices to separate reads from genuinely novel taxa from reads that simply match multiple reference genomes with similar identity.

Key Metrics

Novelty Index (N): Sequence divergence from closest reference genome
Placement Uncertainty (U): Phylogenetic ambiguity based on ANI between competing genomes
Classification: Known Species, Novel Species, Novel Genus, or Conserved Region

How It Works

Competitive alignment: Reads aligned against all reference genomes simultaneously
ANI-weighted scoring: Top hits evaluated using genome-genome ANI matrix
Uncertainty quantification: Placement confidence computed from ANI between competing matches
Threshold-based classification: Literature-backed boundaries for species (95% ANI) and genus (75% ANI)

Installation

1. Install Python Package

# Clone repository
git clone https://github.com/FOI-Bioinformatics/metadarkmatter.git
cd metadarkmatter

# Install with pip
pip install -e .

Requirements: Python >= 3.11

2. Install External Tools

Metadarkmatter requires external bioinformatics tools. Install via conda/mamba:

conda create -n metadarkmatter -c conda-forge -c bioconda \
  kraken2 krakentools blast skani mmseqs2 diamond pyani
conda activate metadarkmatter

Core Tools:

Kraken2 + KrakenTools: Taxonomic read classification and extraction
BLAST+: Nucleotide alignment (accepts FASTQ directly)
skani: Fast ANI computation
MMseqs2: Fast sequence search for large datasets (>100K reads)
Diamond: Protein alignment and AAI computation (for protein mode)

Optional: seqtk (assembly workflows), pyani (alternative ANI tool)

Quick Start

Complete workflow for detecting novel diversity in Francisellaceae:

# 1. Download reference genomes from GTDB
mdm download genomes list "f__Francisellaceae" --output genomes.tsv
mdm download genomes fetch --accessions genomes.tsv --output-dir genomes/

# 2. Extract target family reads (requires Kraken2 database)
mdm kraken2 classify --reads-1 sample_R1.fastq.gz --kraken-db db/ --output kraken_out/
mdm kraken2 extract --kraken-output kraken_out/sample.kraken \
  --reads-1 sample_R1.fastq.gz --taxid 119060 --output extraction/

# 3. Build BLAST database and align
mdm blast makedb --genomes genomes/ --output blastdb/pangenome
mdm blast align --query extraction/reads_R1.fastq.gz \
  --database blastdb/pangenome --output sample.blast.tsv.gz --threads 16

# 4. Compute ANI matrix
mdm ani compute --genomes genomes/ --output ani_matrix.csv --threads 16

# 5. Classify reads and generate report
mdm score classify --alignment sample.blast.tsv.gz --ani ani_matrix.csv \
  --metadata genome_metadata.tsv --output classifications.csv
mdm report generate --classifications classifications.csv \
  --metadata genome_metadata.tsv --output report.html

Note: Both metadarkmatter and mdm commands are available. See Tutorial for detailed walkthrough.

Commands

Command	Description
`download genomes list`	Query GTDB for family genomes (creates metadata)
`download genomes fetch`	Download genomes from NCBI
`kraken2 classify`	Run Kraken2 classification
`kraken2 extract`	Extract reads for target taxid
`blast makedb`	Build BLAST database (standardizes headers)
`blast align`	Run competitive BLAST alignment
`mmseqs2 makedb`	Build MMseqs2 database (for large datasets)
`mmseqs2 search`	MMseqs2 search (supports paired-end)
`blastx align`	Protein-level alignment with Diamond
`ani compute`	Compute ANI matrix (skani/fastANI)
`aai compute`	Compute AAI matrix (Diamond)
`score classify`	ANI-weighted classification (core algorithm)
`score batch`	Batch process multiple samples
`score extract-novel`	Extract candidate novel species/genera
`report generate`	Create HTML report with interactive tabs
`report multi`	Multi-sample comparison
`util generate-mapping`	Generate contig-to-genome ID mapping

Key Features

Species-Level Tracking

Metadarkmatter automatically tracks species metadata throughout the pipeline:

Automatic metadata: genome_metadata.tsv created during genome download with species, genus, family, and full GTDB taxonomy
Standardized headers: FASTA headers rewritten to {accession}|{contig_id} format for reliable genome identification from multi-contig draft genomes
Species aggregation: --metadata option in score classify adds species/genus columns to output
Species breakdown: HTML reports include a dedicated "Species Breakdown" tab with composition charts

Novel Diversity Detection

Identify Novel Species candidates (4-20% novelty, <1.5% uncertainty)
Identify Novel Genus candidates (20-25% novelty, <1.5% uncertainty)
Extract candidate reads with score extract-novel for targeted assembly
Literature-backed thresholds based on 95-96% ANI species boundary

Coverage-Weighted Hit Selection

Optional feature to prioritize alignments spanning larger portions of reads:

Problem solved: Short conserved domains (e.g., 16S rRNA) with high identity can dominate over longer, more informative alignments
Solution: Weight bitscore by alignment coverage to favor hits that explain more of the read
Modes: Linear, logarithmic, or sigmoid weighting functions
Usage: --coverage-weight-mode linear in score classify

Performance

Classification Performance (BLAST results → classifications):

All classification uses the Polars-based vectorized engine with automatic parallelization.

Dataset	Mode	Runtime	RAM
< 10M alignments	default	2-10 min	4-8 GB
10-100M	default	15-45 min	16 GB
100M+	`--streaming`	1-2 hr	16 GB (constant)

Alignment Performance (reads → BLAST results):

BLAST is efficient for typical workflows. For very large datasets (>100K reads), consider MMseqs2:

Reads	BLAST Time	MMseqs2 Time	When to Use
<10K	10s-10min	Slower	Use BLAST
100K	30-60 min	5-10 min	Either tool works
1M+	3-6 hours	15-30 min	MMseqs2 recommended

Note: BLAST accepts FASTQ directly (automatic conversion to FASTA). See Tutorial for workflow details.

Use Cases

Environmental Monitoring

Track novel pathogen emergence in environmental samples
Monitor microbial diversity shifts in ecosystems
Identify candidate novel species for targeted isolation

Biosurveillance

Detect divergent bacterial lineages in clinical or environmental eDNA
Quantify taxonomic coverage gaps relative to reference databases
Prioritize samples for further characterization

Research Applications

Characterize uncultured microbial diversity
Validate reference genome coverage for target taxa
Generate candidate lists for genome-resolved metagenomics

Documentation

Getting Started

Tutorial - Complete walkthrough for environmental samples
Workflow Guide - Step-by-step analysis patterns
User Guide - Detailed usage examples

Reference

CLI Reference - Complete command documentation
API Reference - Python API documentation
Algorithm Details - Statistical framework and literature references

Advanced

Performance Guide - Optimization strategies
Troubleshooting - Common issues and solutions
Architecture - System design and internals

Citation

If you use metadarkmatter in your research, please cite:

@software{metadarkmatter2026,
  author = {Metadarkmatter Team},
  title = {Metadarkmatter: ANI-weighted placement uncertainty for detecting novel microbial diversity},
  year = {2026},
  url = {https://github.com/FOI-Bioinformatics/metadarkmatter}
}

Related Methods

This tool implements concepts from:

ANI species boundary: Jain et al. (2018). "High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries." Nature Communications 9:5114.
Competitive recruitment: Rodriguez-R et al. (2018). "The Microbial Genomes Atlas (MiGA) webserver." Nucleic Acids Research.

Contributing

Contributions are welcome! Please:

Open an issue for bug reports or feature requests
Submit pull requests for code contributions
Follow existing code style (ruff formatting, type hints)
Add tests for new features

Support

Issues: GitHub Issues
Documentation: docs/
Tutorial: docs/TUTORIAL_ENVIRONMENTAL_SPECIES.md

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
config		config
docs		docs
src/metadarkmatter		src/metadarkmatter
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadarkmatter

Overview

Key Metrics

How It Works

Installation

1. Install Python Package

2. Install External Tools

Quick Start

Commands

Key Features

Species-Level Tracking

Novel Diversity Detection

Coverage-Weighted Hit Selection

Performance

Use Cases

Environmental Monitoring

Biosurveillance

Research Applications

Documentation

Getting Started

Reference

Advanced

Citation

Related Methods

Contributing

Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Metadarkmatter

Overview

Key Metrics

How It Works

Installation

1. Install Python Package

2. Install External Tools

Quick Start

Commands

Key Features

Species-Level Tracking

Novel Diversity Detection

Coverage-Weighted Hit Selection

Performance

Use Cases

Environmental Monitoring

Biosurveillance

Research Applications

Documentation

Getting Started

Reference

Advanced

Citation

Related Methods

Contributing

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages