🧬 PathogenFlow

Unified Nextflow pipeline for viral and bacterial pathogen genomics

⚙️ Conda/Mamba-based pipeline — Mamba is mandatory for environment setup and dependency handling.This is tested on Nextflow veriosn 24 and 25

PathogenFlow is an all-in-one Nextflow DSL2 pipeline for the unix based bioinformatics analysis of viral and bacterial pathogens from Illumina paired-end sequencing data.
It performs quality control, taxonomic classification, assembly, consensus generation, typing, AMR detection, and functional annotation — all automatically.

The workflow supports multiple pathogens out of the box and can easily be extended to new ones. How to run it is on the bottom of the page.

🧩 Overview

Step	Description	Main Tools
1. Quality control	Read trimming and quality reports	`fastp`, `fastqc`
2. Classification	Taxonomic identification of all samples	`kraken2`
3. Contamination check	Identify potential contaminants	custom `contamination.nf`
4. Branching logic	Automatically routes samples to viral, bacterial, or unknown subworkflows	internal
5. Viral workflow	Detect best reference (KMA) → consensus generation (bcftools/samtools)	`KMA`, `bcftools`, `samtools`
6. Bacterial workflow	Assembly, annotation, AMR & virulence detection, serotyping	`SPAdes`, `QUAST`, `Prokka`, `AMRFinder`, `RGI`, `MLST`, `VFDB`, `MOB-suite`
7. Unknown samples	De novo assembly + species ID	`SPAdes`, `RefSeq Masher`
8. Reports	Combined tables, consensus FASTAs, QC summaries	custom combiners

🧫 Supported Pathogens

🧬 Viruses

Virus	Steps included
SARS-CoV-2 (COVID-19)	KMA → best reference → bcftools consensus → depth plots
Enteroviruses	KMA → best reference → bcftools consensus → depth plots
RSV (Respiratory Syncytial Virus)	KMA → best reference → bcftools consensus → depth plots
Measles virus (N450)	KMA → best reference → bcftools consensus → depth plots
Influenza virus (segmented)	Segment-wise KMA → per-segment consensus FASTAs

🧫 Bacteria

Species	Modules included
Salmonella enterica	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, SISTR, VFDB, MOB-suite
Listeria monocytogenes	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST,LisSero, VFDB, MOB-suite
Neisseria meningitidis	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Meningotype, VFDB, MOB-suite
Legionella pneumophila	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Legsta, ElGato
Klebsiella pneumoniae	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Kleborate, VFDB
Escherichia coli	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, STECFinder, ECTyper, VFDB
Enterococcus faecalis / faecium	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, VFDB
Staphylococcus aureus	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, AgrVate, spaTyper, Salty
Bordetella pertussis	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Antigenic typing
Campylobacter jejuni / coli	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, RefSeq Masher
β-hemolytic Streptococcus	SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, EmmTyper
Generic bacteria	SPAdes + QUAST + Prokka annotation

❓ Unknown / unclassified samples

Samples with no clear classification:

Assembled de novo using SPAdes
Screened against RefSeq using RefSeq Masher
Results saved in results/unknown/

🧠 Workflow Diagram

🧬 Optional: cgMLST / wgMLST Module

PathogenFlow includes a standalone cgMLST/wgMLST workflow using pyMLST and GrapeTree for allelic distance clustering and tree visualization.

🔧 Usage

nextflow run cgmlst_main.nf \
  --species "Salmonella enterica" \
  --input ./assemblies \
  --output ./cgmlst_out

## Instal & usage ##

## 1-INSTALL CONDA & NEXTFLOW ##
##IF YOU HAVE NEXTFLOW AND CONDA ALREADY INSTALLED  SKIP TO PART WITH GITCLONE

wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
source ~/anaconda3/bin/activate
conda init
relog (exit then again log to server)
source ~/.bashrc
conda --version


## Install Java & Nextflow ##
sudo apt install openjdk-17-jre-headless
java --version

curl -s https://get.nextflow.io | bash
chmod +x nextflow
mkdir -p .local/bin/
mv nextflow .local/bin/
exit


## clone repository ##
git clone https://github.com/Gagi1993/pathogenflow.git
cd pathogenflow

## DATABASE CREATION ##
#extract db
cd db/
tar -xJf bakt.tar.xz
tar -xJf bpert.tar.xz
tar -xJf campy.tar.xz
tar -xJf kma_db.tar.xz

#amrfinder db
nextflow run main.nf -entry on_demand_db

#rgi db-takes time cca 15 mins
nextflow run main.nf -entry rgi_setup

#abricate vfdb setup
nextflow run main.nf -entry on_demand_vfdb

#bwa idnex
nextflow run main.nf -entry bwa_index_db

#kma db on virus
nextflow run main.nf -entry kma_index_db

#KRAKEN DB -it takes some time to download
cd db/

curl -O ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz
tar -xvzf minikraken2_v2_8GB_201904.tgz


### RUN THE PIPE ###

##place your fastq files in fastq/ folder or where ever are your fastq files
##if you fastq files are directly from illumina this step will concat them and rename(if it is from miniseq it will just rename it!)
cd fastq/
for i in *L001_R1_001.fastq.gz; do
    base="${i%%_S*}"
    cat "${base}"*"_R1_001.fastq.gz" > "${base}_R1.fastq.gz"
    cat "${base}"*"_R2_001.fastq.gz" > "${base}_R2.fastq.gz"
done


#generate sample sheet -i option is the folder location of your fastq files
python python/generate_sample_sheet_illumina.py -i fastq/ -o samplesheet.csv

#run
nextflow run main.nf

#run with full reports
nextflow run main.nf -with-dag flowchart.png -with-report report.html -with-timeline timeline.html -with-trace trace.txt

#generate html xlsx reports after run
bash generate_reports.sh


##To-Do / Development Roadmap

| Area              | Planned Feature                                                                                         |
| ----------------- | ------------------------------------------------------------------------------------------------------- |
| **Viruses**       | Add primer trimming from BAM or raw FASTQs (via `ivar` or `cutadapt`) , nextclade                       |
| **Bacteria**      | Integrate pangenome analysis (e.g., Roary or Panaroo) after SPAdes                                      |
| **All**           | Improve contamination detection logic, overall log within script                                        |
| **Databases**     | Add automatic version checking for AMRFinder, VFDB, RGI                                                 |
| **Extensibility** | Seamless addition of new bacteria or viruses — just update `params.species_map` and add reference FASTA |

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
db		db
envs		envs
fastq		fastq
image		image
modules		modules
python		python
README.md		README.md
cgmlst_main.nf		cgmlst_main.nf
generate_reports.sh		generate_reports.sh
main.nf		main.nf
nextflow.config		nextflow.config
samplesheet_example.csv		samplesheet_example.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 PathogenFlow

Unified Nextflow pipeline for viral and bacterial pathogen genomics

🧩 Overview

🧫 Supported Pathogens

🧬 Viruses

🧫 Bacteria

❓ Unknown / unclassified samples

🧠 Workflow Diagram

🧬 Optional: cgMLST / wgMLST Module

🔧 Usage

About

Uh oh!

Releases

Packages

Languages

Gagi1993/pathogenflow

Folders and files

Latest commit

History

Repository files navigation

🧬 PathogenFlow

Unified Nextflow pipeline for viral and bacterial pathogen genomics

🧩 Overview

🧫 Supported Pathogens

🧬 Viruses

🧫 Bacteria

❓ Unknown / unclassified samples

🧠 Workflow Diagram

🧬 Optional: cgMLST / wgMLST Module

🔧 Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages