Pannotator is a scalable and robust pangenome-based prokaryotic genome annotation tool, designed to efficiently process thousands of genomes. It is built upon Bakta to reliably annotate protein-coding and ncRNA genes, while leveraging the scalability and reproducibility of Nextflow.
- Pannotator orchestrates Bakta annotation steps in a modular Nextflow pipeline. It supports the annotation of ncRNA cis-regulatory regions, oriC/oriV/oriT, assembly gaps, as well as tRNA, tmRNA, rRNA, ncRNA genes, CRISPRs, CDSs, and pseudogenes via Bakta.
- To minimise redundant computation, Pannotator clusters CDS features across genomes and annotates only representative sequences from each cluster, propagating annotations back to cluster members.
Prerequisites:
- Nextflow
>= 21.04.0 - Conda or Docker/Singularity
To use Pannotator you need to clone the repo and add path to the executable to the PATH variable:
git clone --recurse-submodules https://github.com/sysbio-vo/pannotator.git
cd pannotator
echo "export PATH=\"\$PATH:$(pwd)/nxf_bin\"" >> ~/.bashrcTo annotate a folder of genomes using an existing Bakta database:
pannotator --indir /path/to/folder/with/genomes --outdir /path/to/output/folder --bakta_db /path/to/bakta_dbTo output intermediate files, such as MMseqs2 clustering results, proteome FASTA file containing all unique sequences, and others, use the --save_intermediate flag.
pannotator --indir /path/to/folder/with/isolates/ --outdir /path/to/output/folder --bakta_db /path/to/bakta_db --save_intermediateIf you don't have a Bakta database, the most recent version will be automatically downloaded during the first run. Note that it might take some time, as the light database v6.0 is ~1.3 GB, while the full database is ~33.1 GB. The light database is downloaded by default. You can specify the required database type through the command line:
pannotator --indir /path/to/folder/with/isolates/ --outdir /path/to/output/folder --bakta_db /path/to/save/bakta/db/ --bakta_db_type [light|full]Available generic execution profiles adapted from the base config by PaM.:
standard(default)dockersingularityconda
Instead of cloning the repository, you can use a dedicated pannotator module on Farm, which is maintained to be up to date with the upstream codebase. To start using it, you need to load the environment first:
module load PaM/environment
module load pannotatorAfter that, you can use the tool by calling pannotator. We recommend using the sanger_lsf profile when running the pipeline on Farm. For instance, to annotate a folder of genomes with Pannotator, run:
pannotator --indir /path/to/folder/with/genomes --outdir /path/to/output/folder --bakta_db /path/to/bakta/db -profile sanger_lsf