Skip to content

A Nextflow pipeline for bacterial and viral pathogen analysis — designed for public health applications.

Notifications You must be signed in to change notification settings

Gagi1993/pathogenflow

Repository files navigation

🧬 PathogenFlow

PathogenFlow

Unified Nextflow pipeline for viral and bacterial pathogen genomics

⚙️ Conda/Mamba-based pipeline — Mamba is mandatory for environment setup and dependency handling.This is tested on Nextflow veriosn 24 and 25

PathogenFlow is an all-in-one Nextflow DSL2 pipeline for the unix based bioinformatics analysis of viral and bacterial pathogens from Illumina paired-end sequencing data.
It performs quality control, taxonomic classification, assembly, consensus generation, typing, AMR detection, and functional annotation — all automatically.

The workflow supports multiple pathogens out of the box and can easily be extended to new ones. How to run it is on the bottom of the page.


🧩 Overview

Step Description Main Tools
1. Quality control Read trimming and quality reports fastp, fastqc
2. Classification Taxonomic identification of all samples kraken2
3. Contamination check Identify potential contaminants custom contamination.nf
4. Branching logic Automatically routes samples to viral, bacterial, or unknown subworkflows internal
5. Viral workflow Detect best reference (KMA) → consensus generation (bcftools/samtools) KMA, bcftools, samtools
6. Bacterial workflow Assembly, annotation, AMR & virulence detection, serotyping SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, VFDB, MOB-suite
7. Unknown samples De novo assembly + species ID SPAdes, RefSeq Masher
8. Reports Combined tables, consensus FASTAs, QC summaries custom combiners

🧫 Supported Pathogens

🧬 Viruses

Virus Steps included
SARS-CoV-2 (COVID-19) KMA → best reference → bcftools consensus → depth plots
Enteroviruses KMA → best reference → bcftools consensus → depth plots
RSV (Respiratory Syncytial Virus) KMA → best reference → bcftools consensus → depth plots
Measles virus (N450) KMA → best reference → bcftools consensus → depth plots
Influenza virus (segmented) Segment-wise KMA → per-segment consensus FASTAs

🧫 Bacteria

Species Modules included
Salmonella enterica SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, SISTR, VFDB, MOB-suite
Listeria monocytogenes SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST,LisSero, VFDB, MOB-suite
Neisseria meningitidis SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Meningotype, VFDB, MOB-suite
Legionella pneumophila SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Legsta, ElGato
Klebsiella pneumoniae SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Kleborate, VFDB
Escherichia coli SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, STECFinder, ECTyper, VFDB
Enterococcus faecalis / faecium SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, VFDB
Staphylococcus aureus SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, AgrVate, spaTyper, Salty
Bordetella pertussis SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Antigenic typing
Campylobacter jejuni / coli SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, RefSeq Masher
β-hemolytic Streptococcus SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, EmmTyper
Generic bacteria SPAdes + QUAST + Prokka annotation

❓ Unknown / unclassified samples

Samples with no clear classification:

  • Assembled de novo using SPAdes
  • Screened against RefSeq using RefSeq Masher
  • Results saved in results/unknown/

🧠 Workflow Diagram

PathogenFlow Workflow Diagram


🧬 Optional: cgMLST / wgMLST Module

PathogenFlow includes a standalone cgMLST/wgMLST workflow using pyMLST and GrapeTree for allelic distance clustering and tree visualization.

🔧 Usage

nextflow run cgmlst_main.nf \
  --species "Salmonella enterica" \
  --input ./assemblies \
  --output ./cgmlst_out

## Instal & usage ##

## 1-INSTALL CONDA & NEXTFLOW ##
##IF YOU HAVE NEXTFLOW AND CONDA ALREADY INSTALLED  SKIP TO PART WITH GITCLONE

wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
source ~/anaconda3/bin/activate
conda init
relog (exit then again log to server)
source ~/.bashrc
conda --version


## Install Java & Nextflow ##
sudo apt install openjdk-17-jre-headless
java --version

curl -s https://get.nextflow.io | bash
chmod +x nextflow
mkdir -p .local/bin/
mv nextflow .local/bin/
exit


## clone repository ##
git clone https://github.com/Gagi1993/pathogenflow.git
cd pathogenflow

## DATABASE CREATION ##
#extract db
cd db/
tar -xJf bakt.tar.xz
tar -xJf bpert.tar.xz
tar -xJf campy.tar.xz
tar -xJf kma_db.tar.xz

#amrfinder db
nextflow run main.nf -entry on_demand_db

#rgi db-takes time cca 15 mins
nextflow run main.nf -entry rgi_setup

#abricate vfdb setup
nextflow run main.nf -entry on_demand_vfdb

#bwa idnex
nextflow run main.nf -entry bwa_index_db

#kma db on virus
nextflow run main.nf -entry kma_index_db

#KRAKEN DB -it takes some time to download
cd db/

curl -O ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz
tar -xvzf minikraken2_v2_8GB_201904.tgz


### RUN THE PIPE ###

##place your fastq files in fastq/ folder or where ever are your fastq files
##if you fastq files are directly from illumina this step will concat them and rename(if it is from miniseq it will just rename it!)
cd fastq/
for i in *L001_R1_001.fastq.gz; do
    base="${i%%_S*}"
    cat "${base}"*"_R1_001.fastq.gz" > "${base}_R1.fastq.gz"
    cat "${base}"*"_R2_001.fastq.gz" > "${base}_R2.fastq.gz"
done


#generate sample sheet -i option is the folder location of your fastq files
python python/generate_sample_sheet_illumina.py -i fastq/ -o samplesheet.csv

#run
nextflow run main.nf

#run with full reports
nextflow run main.nf -with-dag flowchart.png -with-report report.html -with-timeline timeline.html -with-trace trace.txt

#generate html xlsx reports after run
bash generate_reports.sh


##To-Do / Development Roadmap

| Area              | Planned Feature                                                                                         |
| ----------------- | ------------------------------------------------------------------------------------------------------- |
| **Viruses**       | Add primer trimming from BAM or raw FASTQs (via `ivar` or `cutadapt`) , nextclade                       |
| **Bacteria**      | Integrate pangenome analysis (e.g., Roary or Panaroo) after SPAdes                                      |
| **All**           | Improve contamination detection logic, overall log within script                                        |
| **Databases**     | Add automatic version checking for AMRFinder, VFDB, RGI                                                 |
| **Extensibility** | Seamless addition of new bacteria or viruses — just update `params.species_map` and add reference FASTA |








About

A Nextflow pipeline for bacterial and viral pathogen analysis — designed for public health applications.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published