Skip to content

sysbio-vo/pannotator

Repository files navigation

Pannotator Logo

License: MIT Run with Conda Run with Docker Run with Singularity

Pannotator: prokaryotic genome annotation at scale

Pannotator is a scalable and robust pangenome-based prokaryotic genome annotation tool, designed to efficiently process thousands of genomes. It is built upon Bakta to reliably annotate protein-coding and ncRNA genes, while leveraging the scalability and reproducibility of Nextflow.

Description

  • Pannotator orchestrates Bakta annotation steps in a modular Nextflow pipeline. It supports the annotation of ncRNA cis-regulatory regions, oriC/oriV/oriT, assembly gaps, as well as tRNA, tmRNA, rRNA, ncRNA genes, CRISPRs, CDSs, and pseudogenes via Bakta.
  • To minimise redundant computation, Pannotator clusters CDS features across genomes and annotates only representative sequences from each cluster, propagating annotations back to cluster members.

Installation

Prerequisites:

To use Pannotator you need to clone the repo and add path to the executable to the PATH variable:

git clone --recurse-submodules https://github.com/sysbio-vo/pannotator.git
cd pannotator
echo "export PATH=\"\$PATH:$(pwd)/nxf_bin\"" >> ~/.bashrc

Examples

To annotate a folder of genomes using an existing Bakta database:

pannotator --indir /path/to/folder/with/genomes --outdir /path/to/output/folder --bakta_db /path/to/bakta_db

To output intermediate files, such as MMseqs2 clustering results, proteome FASTA file containing all unique sequences, and others, use the --save_intermediate flag.

pannotator --indir /path/to/folder/with/isolates/ --outdir /path/to/output/folder --bakta_db /path/to/bakta_db --save_intermediate

If you don't have a Bakta database, the most recent version will be automatically downloaded during the first run. Note that it might take some time, as the light database v6.0 is ~1.3 GB, while the full database is ~33.1 GB. The light database is downloaded by default. You can specify the required database type through the command line:

pannotator --indir /path/to/folder/with/isolates/ --outdir /path/to/output/folder --bakta_db /path/to/save/bakta/db/ --bakta_db_type [light|full]

Available generic execution profiles adapted from the base config by PaM.:

  • standard (default)
  • docker
  • singularity
  • conda

Examples for the Wellcome Sanger Institute's Farm users

Instead of cloning the repository, you can use a dedicated pannotator module on Farm, which is maintained to be up to date with the upstream codebase. To start using it, you need to load the environment first:

module load PaM/environment
module load pannotator

After that, you can use the tool by calling pannotator. We recommend using the sanger_lsf profile when running the pipeline on Farm. For instance, to annotate a folder of genomes with Pannotator, run:

pannotator --indir /path/to/folder/with/genomes --outdir /path/to/output/folder --bakta_db /path/to/bakta/db -profile sanger_lsf

About

Fast prokaryotic genomes annotation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors