Comparative Analysis of Plant Co-Expression Networks in R (RComPlEx) is a Nextflow pipeline that identifies conserved co-expressologs which are orthologous genes that maintain coordinated expression patterns across evolutionarily divergent species.
Unlike traditional comparative genomics approaches that focus on sequence conservation alone, RComPlEx discovers genes that are:
- Orthologous - Share common evolutionary ancestry
- Co-expressed - Show correlated expression patterns within each species
- Functionally conserved - Maintain similar regulatory relationships across species
- Module members - Part of complete functional networks (cliques), not just pairwise connections
RComPlEx finds multi-species gene cliques where all pairwise relationships are conserved:
- Traditional approach: "Gene A and Gene B are co-expressed in both species"
- RComPlEx cliques: "Genes A, B, C, and D are all mutually co-expressed across all species" (6 pairwise relationships for 4 genes, all significant)
This ensures discovery of complete functional modules with tight coordination, providing stronger evidence of conserved biological function than pairwise comparisons alone.
- METHOD.md - Detailed algorithm explanation, statistical methods, hypergeometric testing
- INPUT_FORMAT.md - Data requirements, file formats, and validation
- INSTALLATION.md - Setup instructions for local, HPC, and container environments
Do genes that are co-expressed in one species maintain those co-expression relationships in related species?
Conserved co-expression suggests:
- Shared regulatory mechanisms (transcription factors, enhancers)
- Functional coordination (pathway components, protein complexes)
- Evolutionary constraint (selection maintains co-regulation)
- Biological importance (essential processes preserved across species)
This implementation focuses on comparing annual vs. perennial grasses:
- Annual plants: Complete life cycle in one growing season (rapid growth, high reproduction)
- Perennial plants: Live multiple years (resource storage, stress tolerance, dormancy)
By stratifying cliques by life habit, RComPlEx identifies:
- Annual-specific modules: Genes coordinated uniquely in annual species
- Perennial-specific modules: Genes coordinated uniquely in perennial species
- Shared modules: Core processes conserved across all plants
Analyzes root and leaf tissues independently:
- Root: Nutrient uptake, water transport, stress response, storage
- Leaf: Photosynthesis, gas exchange, defense, transpiration
Tissue-specific cliques reveal organ-specialized regulatory networks.
- Tissue-specific analysis: Separate processing for root and leaf tissues
- Life habit stratification: Identifies annual-specific, perennial-specific, and shared cliques
- Multi-copy ortholog handling: Clique detection naturally handles many-to-many orthology
- Per-pair network computation: Networks computed on pair-specific gene sets, matching the original RComPlEx algorithm
- Centralized configuration: All parameters in a single YAML file
# Test with 3 pairs per tissue (15-30 min - recommended first!)
nextflow run main.nf -profile slurm --test_mode true
# Full pipeline
nextflow run main.nf -profile slurm
# Single tissue only
nextflow run main.nf -profile slurm --tissues root
# Resume from interruption
nextflow run main.nf -profile slurm -resume# Submit long-running pipeline via SLURM
sbatch slurm/run_nf-rcomplex.shDefault paths (edit nextflow.config to change):
- Working directory:
/mnt/users/martpali/AnnualPerennial/RComPlEx - Output directory:
/mnt/users/martpali/AnnualPerennial/RComPlEx/results - Nextflow work:
/mnt/users/martpali/AnnualPerennial/RComPlEx/work
Override at runtime:
nextflow run main.nf -profile slurm \
--workdir /custom/path \
--outdir /custom/results \
-w /custom/workEdit config/pipeline_config.yaml to customize:
- Species lists (annual vs perennial)
- Tissues to analyze
- RComPlEx parameters (correlation method, network density, p-value threshold)
Edit nextflow.config for:
- Resource allocation (CPUs, memory, time limits)
- Directory paths
- SLURM settings
Located in results/{tissue}/:
cliques.qs2- All cliques in fast qs2 format (for R analysis)cliques.csv- All cliques in CSV formatcliques_annual.csv- Annual-specific cliquescliques_perennial.csv- Perennial-specific cliquescliques_shared.csv- Mixed annual/perennial cliques
CliqueID: Unique identifier (HOG_CliqueNumber)HOG: Hierarchical ortholog groupCliqueSize: Number of genes in cliqueGenes: Comma-separated gene IDsSpecies: Species represented in cliqueLifeHabit: Annual, Perennial, or Mixedn_annual_species: Number of annual speciesn_perennial_species: Number of perennial speciesn_species: Total species countMean_pval: Average p-value across all gene pairsMedian_pval: Median p-valueMean_effect_size: Average effect size (enrichment)n_edges: Number of co-expressed gene pairs
RComPlEx/
├── main.nf # Nextflow workflow (6 processes)
├── nextflow.config # Nextflow execution config
├── config/
│ └── pipeline_config.yaml # Central analysis parameters
├── R/
│ ├── config_parser.R # Config utilities
│ └── orion_hpc_utils.R # Orion HPC path resolution
├── scripts/
│ ├── prepare_single_pair.R # Pair data preparation
│ ├── rcomplex_01_load_filter.R # Step 1: Load & filter data
│ ├── rcomplex_02_compute_networks.R # Step 2: Per-pair network computation
│ ├── rcomplex_03_network_comparison.R # Step 3: Network comparison
│ ├── rcomplex_04_summary_stats.R # Step 4: Summary statistics
│ ├── find_cliques_streaming.R # Streaming clique detection
│ └── validate_inputs.R # Input validation
├── slurm/
│ └── run_nf-rcomplex.sh # SLURM submission script
├── rcomplex_data/ # Working directories
│ ├── root/ # Root tissue analysis
│ └── leaf/ # Leaf tissue analysis
└── results/ # Final pipeline outputs
├── root/ # Root cliques & gene lists
└── leaf/ # Leaf cliques & gene lists
PREPARE_PAIR → RCOMPLEX_01_LOAD_FILTER → RCOMPLEX_02_COMPUTE_NETWORKS
→ RCOMPLEX_04_NETWORK_COMPARISON → RCOMPLEX_05_SUMMARY_STATS → FIND_CLIQUES_STREAMING
Purpose: Extract tissue-specific expression data for each species pair
- Loads variance-stabilized expression matrix (vst_hog.RDS)
- Filters to focal tissue (root or leaf)
- Extracts ortholog mappings from hierarchical ortholog groups (N1_clean.RDS)
- Creates species pair working directories
- Output: Expression files + ortholog table per pair
Purpose: Validate data and filter to analyzable gene sets
- Loads expression matrices for both species
- Filters orthologs to genes present in BOTH species' expression data
- Removes genes with missing values or low expression
- Output:
01_filtered_data.RData(clean, aligned ortholog pairs)
Purpose: Build gene co-expression networks for each species pair
Networks are computed on the pair-specific gene set (genes with orthologs in this specific pair), matching the original RComPlEx algorithm. This ensures MR normalization ranks and density thresholds are calibrated to the exact gene universe used in the comparison.
Method:
-
Correlation calculation:
- Spearman correlation (default) - robust to outliers, detects monotonic relationships
- Computes all pairwise gene correlations (gene × gene matrix)
-
Mutual Rank normalization:
- Ranks correlations bidirectionally: gene_i → gene_j AND gene_j → gene_i
- Combines ranks: MR = √(rank_ij × rank_ji)
- Why? Reduces spurious correlations, makes strengths comparable across species
-
Density thresholding:
- Keep top 3% of correlations (configurable)
- Creates sparse network of strongest co-expression
Output: 02_networks.qs2 (networks, thresholds, ortholog table in fast qs2 serialization)
Purpose: Test whether co-expression relationships are conserved between species
For each ortholog pair (geneA_sp1, geneB_sp2):
-
Extract co-expression neighborhoods:
- Sp1_neighbors: All genes significantly co-expressed with geneA in Species 1
- Sp2_neighbors: All genes significantly co-expressed with geneB in Species 2
-
Test conservation bidirectionally:
Direction 1 (Sp1 → Sp2):
- Question: "Do geneA's neighbors in Sp1 have orthologs that are geneB's neighbors in Sp2?"
- Count overlap of conserved co-expression relationships
- Hypergeometric test: P(overlap ≥ observed | random expectation)
Direction 2 (Sp2 → Sp1):
- Question: "Do geneB's neighbors in Sp2 have orthologs that are geneA's neighbors in Sp1?"
- Ensures bidirectional conservation (not just one-way)
- Use minimum p-value (most conservative estimate)
-
Multiple testing correction:
- Benjamini-Hochberg FDR correction
- Controls false discovery rate at 5% (adjustable)
Output: 03_comparison.RData (p-values, effect sizes, neighborhood statistics)
Purpose: Aggregate conservation statistics and generate diagnostic plots
- Computes per-gene conservation metrics (mean/median p-values)
- Generates diagnostic visualizations
- Output:
04_summary_statistics.tsv+ PNG plots
Purpose: Identify multi-species gene cliques where ALL pairwise co-expression is conserved
Algorithm (Two-Pass Streaming):
- Pass 1: Index HOG → files (lightweight scan of all comparison files)
- Pass 2: Process per-HOG (load only relevant data for each HOG)
For each Hierarchical Ortholog Group (HOG):
a. Build graph:
- Nodes = genes from all species in HOG
- Edges = conserved co-expression relationships (p < 0.05)
b. Find maximal cliques using igraph::max_cliques():
- Clique = complete subgraph where all nodes are connected
- Maximal = cannot add more genes without losing completeness
c. Annotate cliques:
- Species composition
- Life habit classification (Annual/Perennial/Mixed)
- Statistical properties (mean p-values, effect sizes)
Memory Efficiency: Only one HOG's data in memory at a time.
Output:
cliques.qs2,cliques.csv(all cliques)cliques_annual.csv,cliques_perennial.csv,cliques_shared.csv
Statistical Rigor:
- Hypergeometric test models drawing without replacement (appropriate for networks)
- Bidirectional testing ensures mutual conservation
- FDR correction controls false discoveries in large-scale testing
Biological Relevance:
- Cliques represent tightly coordinated gene modules
- Conservation across species indicates functional importance
- Life habit stratification reveals adaptation-specific networks
- Tissue specificity captures organ-specialized regulation
Computational Efficiency:
- Parallelization across species pairs and ortholog pairs
- Sparse networks reduce memory and computation
- Nextflow enables resumable, scalable execution
-
vst_hog.RDS (253 MB): Variance-stabilized expression data
- Columns: species, tissue, GeneID, sample_id, vst.count, life_cycle, HOG
- Tissues: "root" and "leaf"
- 13 grass species (5 annual, 8 perennial)
-
N1_clean.RDS (6 MB): Hierarchical ortholog groups
- Columns: HOG, OG, species, GeneID, life_cycle, is_core
- Format: One row per gene-HOG association
Annual (4 species):
- Brachypodium distachyon
- Hordeum vulgare
- Vulpia bromoides
- Briza maxima
Perennial (4 species):
- Brachypodium sylvaticum
- Hordeum jubatum
- Briza media
- Festuca pratensis
- Test mode: 15-30 minutes (3 pairs per tissue)
- Full pipeline: 2-6 hours
- Parallel execution: Up to 30 jobs simultaneously (label-dependent)
- Resources per job: Adaptive (8 GB → 800 GB depending on step)
| Process | CPUs | Memory | Time | MaxForks |
|---|---|---|---|---|
| PREPARE_PAIR | 2 | 8 GB | 15m | 20 |
| RCOMPLEX_01_LOAD_FILTER | 2 | 8 GB | 15m | 20 |
| RCOMPLEX_02_COMPUTE_NETWORKS | 24→36 | 800 GB | 144h | 10 |
| RCOMPLEX_04_NETWORK_COMPARISON | 12→24 | 200→600 GB | 72h | 10 |
| RCOMPLEX_05_SUMMARY_STATS | 4 | 16 GB | 24h | 10 |
| FIND_CLIQUES_STREAMING | 4 | 64→256 GB | 24h | 4 |
- Intermediate files: 10-50 GB
- Final results: 1-5 GB
- Nextflow work: 50-200 GB (can be cleaned after successful run)
# Watch SLURM queue
watch -n 10 'squeue -u $USER'
# View logs
tail -f .nextflow.log
# Check disk space
df -h /mnt/users/martpali/AnnualPerennial/No cliques found: Check p-value threshold in config/pipeline_config.yaml (default: 0.05)
Memory errors: Pipeline automatically retries with increased memory (200 GB → 400 GB)
Jobs stuck in queue: Check SLURM partition limits with scontrol show partition orion
If you use this pipeline, please cite:
- ComPlEx method: Netotea et al. (2014) BMC Genomics. doi:10.1186/1471-2164-15-106
- R implementation: Torgeir R. Hvidsten
- Nextflow pipeline & parallelization: Martin Paliocha
- Pipeline repository: paliocha/RComPlEx-NF