Skip to content

collaborativebioinformatics/OncoLearn

Repository files navigation

OncoLearn

Python R uv renv License

A comprehensive toolkit for cancer genomics analysis and biomarker discovery using RNA-seq data from The Cancer Genome Atlas (TCGA). OncoLearn leverages machine learning and statistical methods for cancer subtyping and identifying potential diagnostic and prognostic markers.

Contributors

Aryan Sharan Guda (aryanshg@andrew.cmu.edu), Seungjin Han (seungjih@andrew.cmu.edu), Seohyun Lee (seohyun4@andrew.cmu.edu), Yosen Lin (yosenl@andrew.cmu.edu), Isha Parikh (parikh.i@northeastern.edu), Diya Patidar (dpatidar@andrew.cmu.edu), Arunannamalai Sujatha Bharath Raj (asujatha@andrew.cmu.edu), Andrew Scouten (yzb2@txstate.edu), Jeffrey Wang (jdw2@andrew.cmu.edu), Qiyu (Charlie) Yang (qiyuy@andrew.cmu.edu), Xinru Zhang (mayzxr2203@gmail.com), River Zhu (riverz@andrew.cmu.edu), Zhaoyi (Zoey) You (zhaoyiyou.zoey@gmail.com), Heena Dalal (dalalhina@gmail.com/heena.dalal@kcl.ac.uk)

Table of Contents

Quickstart

  1. Install Docker Desktop from docker.com

  2. Clone and setup:

    git clone https://github.com/collaborativebioinformatics/OncoLearn.git
    cd OncoLearn
    git submodule update --init --recursive
    # For NVIDIA GPUs:
    docker compose --profile nvidia up -d
    # For AMD GPUs (native Linux):
    docker compose --profile amd up -d
    # For AMD GPUs (WSL2):
    docker compose --profile amd-wsl up -d
  3. Download sample data:

    # Download genomics data from Xena Browser
    docker compose exec dev oncolearn download --xena --cohorts BRCA
    
    # Download imaging data from TCIA (manifest only)
    docker compose exec dev oncolearn download --tcia --cohorts BRCA
    
    # Download imaging data from TCIA (manifest + images)
    docker compose exec dev oncolearn download --tcia --cohorts BRCA --download-images
  4. Start exploring with the Jupyter notebooks in notebooks/data/

For detailed setup options and local installation, see Getting Started.

Getting Started

Prerequisites

This project supports two installation methods:

Option A: Docker (Recommended)

  • Docker Desktop or Docker Engine
  • Docker Compose
  • VSCode with Dev Containers extension (optional but recommended)

Option B: Local Installation

  • Python 3.10+
  • R 4.0+
  • uv - Fast Python package installer and resolver

Option A: Docker Installation (Recommended)

Docker provides a consistent development environment and eliminates dependency and compatibility issues.

  1. Install Docker Desktop:

    • Download from docker.com
    • Or install Docker Engine on Linux
  2. Clone the repository:

    git clone https://github.com/collaborativebioinformatics/OncoLearn.git
    cd OncoLearn
    git submodule update --init --recursive
  3. Start the environment:

    # For NVIDIA GPUs:
    docker compose --profile nvidia up -d
    
    # For AMD GPUs (native Linux):
    docker compose --profile amd up -d
    
    # For AMD GPUs (WSL2 on Windows):
    docker compose --profile amd-wsl up -d

    Note: The Docker setup includes GPU support for both NVIDIA and AMD GPUs. Choose the appropriate profile based on your hardware:

    • nvidia: For NVIDIA GPUs
    • amd: For AMD GPUs on native Linux
    • amd-wsl: For AMD GPUs on Windows Subsystem for Linux 2 (WSL2)
  4. Open in VSCode Dev Container (optional):

    • Install the Dev Containers extension
    • Press F1 → "Dev Containers: Reopen in Container"
    • VSCode will connect to the container with all extensions and tools configured
    • Jupyter notebooks (.ipynb files) will work natively in VSCode without a browser

Useful Docker Commands:

# Stop containers
docker compose down

# Rebuild after dependency changes (use your GPU profile)
docker compose --profile nvidia build  # or --profile amd or --profile amd-wsl
docker compose --profile nvidia up -d  # or --profile amd or --profile amd-wsl

# Execute commands in container
docker compose exec dev bash  # NVIDIA
docker compose exec dev-amd bash  # AMD (native Linux)
docker compose exec dev-amd-wsl bash  # AMD (WSL2)

# Add new Python packages
docker compose exec dev uv add <package-name>  # NVIDIA
docker compose exec dev-amd uv add <package-name>  # AMD (native Linux)
docker compose exec dev-amd-wsl uv add <package-name>  # AMD (WSL2)

# View running containers
docker compose ps

Option B: Local Installation

  1. Install uv (if not already installed) from here.

  2. Clone the repository:

    git clone https://github.com/collaborativebioinformatics/OncoLearn.git
    cd OncoLearn
    git submodule update --init --recursive
  3. Install Python dependencies:

    # Install base dependencies
    uv sync
    
    # Or install with PyTorch extras (choose one based on your hardware):
    uv sync --extra cpu          # CPU-only version
    uv sync --extra cu128        # CUDA 12.8
    uv sync --extra cu130        # CUDA 13.0
    uv sync --extra rocm         # AMD ROCm
  4. Install R dependencies with renv:

    # Install renv if not already installed
    install.packages("renv")
    
    # Restore R package dependencies
    renv::restore()

Recommended VSCode Extensions

For the best development experience, we recommend installing the following VSCode extensions:

  • Python (ms-python.python) - IntelliSense, debugging, and linting for Python
  • Ruff (charliermarsh.ruff) - Fast Python linter and formatter
  • autopep8 (ms-python.autopep8) - Python code formatter following PEP 8 style guide
  • R (REditorSupport.r) - R language support with syntax highlighting and code execution
  • Jupyter (ms-toolsai.jupyter) - Interactive Jupyter notebook support
  • Dev Containers (ms-vscode-remote.remote-containers) - For Docker development (if using Docker)

Data

OncoLearn provides a unified download script for acquiring cancer data from multiple sources:

Download from Xena Browser (Genomics Data)

# Download a single cohort (all data types)
oncolearn download --xena --cohorts BRCA

# Download and extract gzipped files
oncolearn download --xena --cohorts BRCA --unzip

# Download specific data category
oncolearn download --xena --cohorts BRCA --category mutation

# Download multiple cohorts
oncolearn download --xena --cohorts BRCA,LUAD,ACC

# Download all available cohorts
oncolearn download --xena --all

# List available cohorts
oncolearn download --xena --list

Available categories: clinical, mutation, cnv, mrna, mirna, protein, methylation

Note: By default, gzipped files are NOT automatically extracted. Use --unzip to extract them after download.

Download from TCIA (Imaging Data)

# Download manifest file only
oncolearn download --tcia --cohorts BRCA

# Download manifest and images (requires nbia-data-retriever)
oncolearn download --tcia --cohorts BRCA --download-images

# Download multiple cohorts with images
oncolearn download --tcia --cohorts BRCA,LUAD --download-images

# List available cohorts
oncolearn download --tcia --list

Note: The --download-images flag requires the nbia-data-retriever tool to be installed.

Docker Usage

When using Docker, prefix commands with the container execution:

# NVIDIA GPU container
docker compose exec dev oncolearn download --xena --cohorts BRCA

# AMD GPU container
docker compose exec dev-amd oncolearn download --tcia --cohorts BRCA --download-images

Installation as a CLI Tool

After installation, oncolearn is available as a command-line tool:

# Using uv (development)
uv run oncolearn download --xena --cohorts BRCA

# After pip install (production)
pip install -e .
oncolearn download --xena --cohorts BRCA

Documentation

Comprehensive guides and documentation are available in the docs/ folder:

Project Structure

  • data/ - Data storage directory (downloaded TCGA datasets)
  • docs/ - Project documentation and guides
  • notebooks/ - Jupyter notebooks for data exploration and analysis
  • scripts/ - Data download and preprocessing scripts
    • download.py - Unified CLI for downloading data from Xena Browser and TCIA
    • data/download_xena.py - Xena Browser download utilities
    • data/download_tcia.py - TCIA download utilities
  • src/oncolearn/ - Core Python package for cancer genomics analysis
  • src/multimodal/ - Multimodal learning framework for integrating multi-omic data
  • configs/ - Configuration files for training and testing

Additional Resources

For more information on downloading and working with TCGA data, see the TCGA Data Download Guide.


License

This project is licensed under the MIT License - see the LICENSE file for details.

AI Disclosure

Artificial intelligence tools, including large language models (LLMs), were used during the development of this project to support writing, clarify technical concepts, and assist in generating code snippets. These tools served as an aid for idea refinement, debugging, and improving the readability of explanations and documentation. All AI-generated text and code were thoroughly reviewed, verified for correctness, and understood in full before being incorporated into this work. The responsibility for all final decisions, interpretations, and implementations remains solely with the contributors.

About

Cancer Biomarker Discovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 16