A pre-trained transformer model for inference on insect DNA barcoding data.
Read our paper in Bioinformatics Advances (Millan Arias et al., 2026). If you use BarcodeBERT in your research, please consider citing us.
from transformers import AutoTokenizer, AutoModel
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"bioscan-ml/BarcodeBERT", trust_remote_code=True
)
# Load the model
model = AutoModel.from_pretrained("bioscan-ml/BarcodeBERT", trust_remote_code=True)
# Sample sequence
dna_seq = "ACGCGCTGACGCATCAGCATACGA"
# Tokenize
input_seq = tokenizer(dna_seq, return_tensors="pt")["input_ids"]
# Pass through the model
output = model(input_seq.unsqueeze(0))["hidden_states"][-1]
# Compute Global Average Pooling
features = output.mean(1)- Clone this repository and install the required libraries.
The instructions below assume a working
pip- oruv-managed Python environment. Requires Python 3.11 or 3.12 (torchtext, a deprecated dependency, does not provide wheels for 3.13+). Seepyproject.tomlfor the full list of pinned dependencies.
pip install -e .Or, using uv:
uv sync- Download the data from our Hugging Face Dataset repository
cd data/
python download_HF_CanInv.pyOptional: You can also download the first version of the data
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip- DNA foundation model baselines: The desired backbone can be selected using one of the following keywords:
BarcodeBERT, NT, Hyena_DNA, DNABERT, DNABERT-2, DNABERT-S
python baselines/knn_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/
python baselines/linear_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/
python baselines/finetuning.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ --batch_size=32
python baselines/zsc.py --backbone=<DESIRED-BACKBONE> --data-dir=data/Note: The DNABERT model has to be downloaded manually following the instructions in the paper's repo and placed in the pretrained-models folder.
- Supervised CNN
python baselines/cnn/1D_CNN_supervised.py
python baselines/cnn/1D_CNN_KNN.py
python baselines/cnn/1D_CNN_Linear_probing.py
python baselines/cnn/1D_CNN_ZSC.py
Note: Train the CNN backbone with 1D_CNN_supervised.py before evaluating it on any downtream task.
- BLAST
cd data/
python to_fasta.py --input_file=supervised_train.csv &&
python to_fasta.py --input_file=supervised_test.csv &&
python to_fasta.py --input_file=unseen.csv
makeblastdb -in supervised_train.fas -title train -dbtype nucl -out train.fas
blastn -query supervised_test.fas -db train.fas -out results_supervised_test.tsv -outfmt 6 -num_threads 16
blastn -query unseen.fas -db train.fas -out results_unseen.tsv -outfmt 6 -num_threads 16To pretrain the model you can run the following command:
python barcodebert/pretraining.py
--dataset=CANADA-1.5M \
--k_mer=4 \
--n_layers=4 \
--n_heads=4 \
--data_dir=data/ \
--checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.ptIf you'd like to contribute to BarcodeBERT, please read our Contributing Guidelines for information about setup, code style, and submission process.
If you find BarcodeBERT useful in your research please consider citing:
@article{MillanArias2026BarcodeBERT,
author={Millan Arias, Pablo and Sadjadi, Niousha and Safari, Monireh
and Gong, ZeMing and Wang, Austin T and Haurum, Joakim Bruslund
and Zarubiieva, Iuliia and Steinke, Dirk and Kari, Lila
and Chang, Angel X and Lowe, Scott C and Taylor, Graham W},
title={{BarcodeBERT}: Transformers for Biodiversity Analyses},
journal={Bioinformatics Advances},
pages={vbag054},
year={2026},
month=feb,
doi={10.1093/bioadv/vbag054},
}