⚙️ Conda/Mamba-based pipeline — Mamba is mandatory for environment setup and dependency handling.This is tested on Nextflow veriosn 24 and 25
PathogenFlow is an all-in-one Nextflow DSL2 pipeline for the unix based bioinformatics analysis of viral and bacterial pathogens from Illumina paired-end sequencing data.
It performs quality control, taxonomic classification, assembly, consensus generation, typing, AMR detection, and functional annotation — all automatically.
The workflow supports multiple pathogens out of the box and can easily be extended to new ones. How to run it is on the bottom of the page.
| Step | Description | Main Tools |
|---|---|---|
| 1. Quality control | Read trimming and quality reports | fastp, fastqc |
| 2. Classification | Taxonomic identification of all samples | kraken2 |
| 3. Contamination check | Identify potential contaminants | custom contamination.nf |
| 4. Branching logic | Automatically routes samples to viral, bacterial, or unknown subworkflows | internal |
| 5. Viral workflow | Detect best reference (KMA) → consensus generation (bcftools/samtools) | KMA, bcftools, samtools |
| 6. Bacterial workflow | Assembly, annotation, AMR & virulence detection, serotyping | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, VFDB, MOB-suite |
| 7. Unknown samples | De novo assembly + species ID | SPAdes, RefSeq Masher |
| 8. Reports | Combined tables, consensus FASTAs, QC summaries | custom combiners |
| Virus | Steps included |
|---|---|
| SARS-CoV-2 (COVID-19) | KMA → best reference → bcftools consensus → depth plots |
| Enteroviruses | KMA → best reference → bcftools consensus → depth plots |
| RSV (Respiratory Syncytial Virus) | KMA → best reference → bcftools consensus → depth plots |
| Measles virus (N450) | KMA → best reference → bcftools consensus → depth plots |
| Influenza virus (segmented) | Segment-wise KMA → per-segment consensus FASTAs |
| Species | Modules included |
|---|---|
| Salmonella enterica | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, SISTR, VFDB, MOB-suite |
| Listeria monocytogenes | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST,LisSero, VFDB, MOB-suite |
| Neisseria meningitidis | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Meningotype, VFDB, MOB-suite |
| Legionella pneumophila | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Legsta, ElGato |
| Klebsiella pneumoniae | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Kleborate, VFDB |
| Escherichia coli | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, STECFinder, ECTyper, VFDB |
| Enterococcus faecalis / faecium | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, VFDB |
| Staphylococcus aureus | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, AgrVate, spaTyper, Salty |
| Bordetella pertussis | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, Antigenic typing |
| Campylobacter jejuni / coli | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, RefSeq Masher |
| β-hemolytic Streptococcus | SPAdes, QUAST, Prokka, AMRFinder, RGI, MLST, EmmTyper |
| Generic bacteria | SPAdes + QUAST + Prokka annotation |
Samples with no clear classification:
- Assembled de novo using
SPAdes - Screened against RefSeq using
RefSeq Masher - Results saved in
results/unknown/
PathogenFlow includes a standalone cgMLST/wgMLST workflow using pyMLST and GrapeTree for allelic distance clustering and tree visualization.
nextflow run cgmlst_main.nf \
--species "Salmonella enterica" \
--input ./assemblies \
--output ./cgmlst_out
## Instal & usage ##
## 1-INSTALL CONDA & NEXTFLOW ##
##IF YOU HAVE NEXTFLOW AND CONDA ALREADY INSTALLED SKIP TO PART WITH GITCLONE
wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
source ~/anaconda3/bin/activate
conda init
relog (exit then again log to server)
source ~/.bashrc
conda --version
## Install Java & Nextflow ##
sudo apt install openjdk-17-jre-headless
java --version
curl -s https://get.nextflow.io | bash
chmod +x nextflow
mkdir -p .local/bin/
mv nextflow .local/bin/
exit
## clone repository ##
git clone https://github.com/Gagi1993/pathogenflow.git
cd pathogenflow
## DATABASE CREATION ##
#extract db
cd db/
tar -xJf bakt.tar.xz
tar -xJf bpert.tar.xz
tar -xJf campy.tar.xz
tar -xJf kma_db.tar.xz
#amrfinder db
nextflow run main.nf -entry on_demand_db
#rgi db-takes time cca 15 mins
nextflow run main.nf -entry rgi_setup
#abricate vfdb setup
nextflow run main.nf -entry on_demand_vfdb
#bwa idnex
nextflow run main.nf -entry bwa_index_db
#kma db on virus
nextflow run main.nf -entry kma_index_db
#KRAKEN DB -it takes some time to download
cd db/
curl -O ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz
tar -xvzf minikraken2_v2_8GB_201904.tgz
### RUN THE PIPE ###
##place your fastq files in fastq/ folder or where ever are your fastq files
##if you fastq files are directly from illumina this step will concat them and rename(if it is from miniseq it will just rename it!)
cd fastq/
for i in *L001_R1_001.fastq.gz; do
base="${i%%_S*}"
cat "${base}"*"_R1_001.fastq.gz" > "${base}_R1.fastq.gz"
cat "${base}"*"_R2_001.fastq.gz" > "${base}_R2.fastq.gz"
done
#generate sample sheet -i option is the folder location of your fastq files
python python/generate_sample_sheet_illumina.py -i fastq/ -o samplesheet.csv
#run
nextflow run main.nf
#run with full reports
nextflow run main.nf -with-dag flowchart.png -with-report report.html -with-timeline timeline.html -with-trace trace.txt
#generate html xlsx reports after run
bash generate_reports.sh
##To-Do / Development Roadmap
| Area | Planned Feature |
| ----------------- | ------------------------------------------------------------------------------------------------------- |
| **Viruses** | Add primer trimming from BAM or raw FASTQs (via `ivar` or `cutadapt`) , nextclade |
| **Bacteria** | Integrate pangenome analysis (e.g., Roary or Panaroo) after SPAdes |
| **All** | Improve contamination detection logic, overall log within script |
| **Databases** | Add automatic version checking for AMRFinder, VFDB, RGI |
| **Extensibility** | Seamless addition of new bacteria or viruses — just update `params.species_map` and add reference FASTA |

