OncoLearn

A comprehensive toolkit for cancer genomics analysis and biomarker discovery using RNA-seq data from The Cancer Genome Atlas (TCGA). OncoLearn leverages machine learning and statistical methods for cancer subtyping and identifying potential diagnostic and prognostic markers.

Contributors

Aryan Sharan Guda (aryanshg@andrew.cmu.edu), Seungjin Han (seungjih@andrew.cmu.edu), Seohyun Lee (seohyun4@andrew.cmu.edu), Yosen Lin (yosenl@andrew.cmu.edu), Isha Parikh (parikh.i@northeastern.edu), Diya Patidar (dpatidar@andrew.cmu.edu), Arunannamalai Sujatha Bharath Raj (asujatha@andrew.cmu.edu), Andrew Scouten (yzb2@txstate.edu), Jeffrey Wang (jdw2@andrew.cmu.edu), Qiyu (Charlie) Yang (qiyuy@andrew.cmu.edu), Xinru Zhang (mayzxr2203@gmail.com), River Zhu (riverz@andrew.cmu.edu), Zhaoyi (Zoey) You (zhaoyiyou.zoey@gmail.com), Heena Dalal (dalalhina@gmail.com/heena.dalal@kcl.ac.uk)

Quickstart

Install Docker Desktop from docker.com

Clone and setup:

git clone https://github.com/collaborativebioinformatics/OncoLearn.git
cd OncoLearn
git submodule update --init --recursive
# For NVIDIA GPUs:
docker compose --profile nvidia up -d
# For AMD GPUs (native Linux):
docker compose --profile amd up -d
# For AMD GPUs (WSL2):
docker compose --profile amd-wsl up -d

Download sample data:

# Download genomics data from Xena Browser
docker compose exec dev oncolearn download --xena --cohorts BRCA

# Download imaging data from TCIA (manifest only)
docker compose exec dev oncolearn download --tcia --cohorts BRCA

# Download imaging data from TCIA (manifest + images)
docker compose exec dev oncolearn download --tcia --cohorts BRCA --download-images

Start exploring with the Jupyter notebooks in notebooks/data/

For detailed setup options and local installation, see Getting Started.

Getting Started

Prerequisites

This project supports two installation methods:

Option A: Docker (Recommended)

Docker Desktop or Docker Engine
Docker Compose
VSCode with Dev Containers extension (optional but recommended)

Option B: Local Installation

Python 3.10+
R 4.0+
uv - Fast Python package installer and resolver

Option A: Docker Installation (Recommended)

Docker provides a consistent development environment and eliminates dependency and compatibility issues.

Install Docker Desktop:
- Download from docker.com
- Or install Docker Engine on Linux

Clone the repository:

git clone https://github.com/collaborativebioinformatics/OncoLearn.git
cd OncoLearn
git submodule update --init --recursive

Start the environment:
```
# For NVIDIA GPUs:
docker compose --profile nvidia up -d

# For AMD GPUs (native Linux):
docker compose --profile amd up -d

# For AMD GPUs (WSL2 on Windows):
docker compose --profile amd-wsl up -d
```
Note: The Docker setup includes GPU support for both NVIDIA and AMD GPUs. Choose the appropriate profile based on your hardware:
- nvidia: For NVIDIA GPUs
- amd: For AMD GPUs on native Linux
- amd-wsl: For AMD GPUs on Windows Subsystem for Linux 2 (WSL2)
Open in VSCode Dev Container (optional):
- Install the Dev Containers extension
- Press F1 → "Dev Containers: Reopen in Container"
- VSCode will connect to the container with all extensions and tools configured
- Jupyter notebooks (.ipynb files) will work natively in VSCode without a browser

Useful Docker Commands:

# Stop containers
docker compose down

# Rebuild after dependency changes (use your GPU profile)
docker compose --profile nvidia build  # or --profile amd or --profile amd-wsl
docker compose --profile nvidia up -d  # or --profile amd or --profile amd-wsl

# Execute commands in container
docker compose exec dev bash  # NVIDIA
docker compose exec dev-amd bash  # AMD (native Linux)
docker compose exec dev-amd-wsl bash  # AMD (WSL2)

# Add new Python packages
docker compose exec dev uv add <package-name>  # NVIDIA
docker compose exec dev-amd uv add <package-name>  # AMD (native Linux)
docker compose exec dev-amd-wsl uv add <package-name>  # AMD (WSL2)

# View running containers
docker compose ps

Option B: Local Installation

Install uv (if not already installed) from here.

Clone the repository:

git clone https://github.com/collaborativebioinformatics/OncoLearn.git
cd OncoLearn
git submodule update --init --recursive

Install Python dependencies:

# Install base dependencies
uv sync

# Or install with PyTorch extras (choose one based on your hardware):
uv sync --extra cpu          # CPU-only version
uv sync --extra cu128        # CUDA 12.8
uv sync --extra cu130        # CUDA 13.0
uv sync --extra rocm         # AMD ROCm

Install R dependencies with renv:

# Install renv if not already installed
install.packages("renv")

# Restore R package dependencies
renv::restore()

Recommended VSCode Extensions

For the best development experience, we recommend installing the following VSCode extensions:

Python (ms-python.python) - IntelliSense, debugging, and linting for Python
Ruff (charliermarsh.ruff) - Fast Python linter and formatter
autopep8 (ms-python.autopep8) - Python code formatter following PEP 8 style guide
R (REditorSupport.r) - R language support with syntax highlighting and code execution
Jupyter (ms-toolsai.jupyter) - Interactive Jupyter notebook support
Dev Containers (ms-vscode-remote.remote-containers) - For Docker development (if using Docker)

Data

OncoLearn provides a unified download script for acquiring cancer data from multiple sources:

Download from Xena Browser (Genomics Data)

# Download a single cohort (all data types)
oncolearn download --xena --cohorts BRCA

# Download and extract gzipped files
oncolearn download --xena --cohorts BRCA --unzip

# Download specific data category
oncolearn download --xena --cohorts BRCA --category mutation

# Download multiple cohorts
oncolearn download --xena --cohorts BRCA,LUAD,ACC

# Download all available cohorts
oncolearn download --xena --all

# List available cohorts
oncolearn download --xena --list

Available categories: clinical, mutation, cnv, mrna, mirna, protein, methylation

Note: By default, gzipped files are NOT automatically extracted. Use --unzip to extract them after download.

Download from TCIA (Imaging Data)

# Download manifest file only
oncolearn download --tcia --cohorts BRCA

# Download manifest and images (requires nbia-data-retriever)
oncolearn download --tcia --cohorts BRCA --download-images

# Download multiple cohorts with images
oncolearn download --tcia --cohorts BRCA,LUAD --download-images

# List available cohorts
oncolearn download --tcia --list

Note: The --download-images flag requires the nbia-data-retriever tool to be installed.

Docker Usage

When using Docker, prefix commands with the container execution:

# NVIDIA GPU container
docker compose exec dev oncolearn download --xena --cohorts BRCA

# AMD GPU container
docker compose exec dev-amd oncolearn download --tcia --cohorts BRCA --download-images

Installation as a CLI Tool

After installation, oncolearn is available as a command-line tool:

# Using uv (development)
uv run oncolearn download --xena --cohorts BRCA

# After pip install (production)
pip install -e .
oncolearn download --xena --cohorts BRCA

Documentation

Comprehensive guides and documentation are available in the docs/ folder:

TCGA Data Download Guide - Detailed instructions for downloading and managing TCGA datasets
TCIA Data Download Guide - Guide for downloading imaging data from TCIA
GitHub Authentication Setup - Configure SSH authentication for GitHub access
Models Documentation - Overview of machine learning models and architectures

Project Structure

data/ - Data storage directory (downloaded TCGA datasets)
docs/ - Project documentation and guides
notebooks/ - Jupyter notebooks for data exploration and analysis
scripts/ - Data download and preprocessing scripts
- download.py - Unified CLI for downloading data from Xena Browser and TCIA
- data/download_xena.py - Xena Browser download utilities
- data/download_tcia.py - TCIA download utilities
src/oncolearn/ - Core Python package for cancer genomics analysis
src/multimodal/ - Multimodal learning framework for integrating multi-omic data
configs/ - Configuration files for training and testing

Additional Resources

For more information on downloading and working with TCGA data, see the TCGA Data Download Guide.

License

This project is licensed under the MIT License - see the LICENSE file for details.

AI Disclosure

Artificial intelligence tools, including large language models (LLMs), were used during the development of this project to support writing, clarify technical concepts, and assist in generating code snippets. These tools served as an aid for idea refinement, debugging, and improving the readability of explanations and documentation. All AI-generated text and code were thoroughly reviewed, verified for correctness, and understood in full before being incorporated into this work. The responsibility for all final decisions, interpretations, and implementations remains solely with the contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
configs		configs
data		data
docs		docs
notebooks		notebooks
renv		renv
scripts		scripts
src		src
submodules		submodules
test		test
.Rprofile		.Rprofile
.docker.env		.docker.env
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
renv.lock		renv.lock
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OncoLearn

Contributors

Table of Contents

Quickstart

Getting Started

Prerequisites

Option A: Docker Installation (Recommended)

Option B: Local Installation

Recommended VSCode Extensions

Data

Download from Xena Browser (Genomics Data)

Download from TCIA (Imaging Data)

Docker Usage

Installation as a CLI Tool

Documentation

Project Structure

Additional Resources

License

AI Disclosure

About

Uh oh!

Releases

Packages

Contributors 16

Uh oh!

Languages

License

collaborativebioinformatics/OncoLearn

Folders and files

Latest commit

History

Repository files navigation

OncoLearn

Contributors

Table of Contents

Quickstart

Getting Started

Prerequisites

Option A: Docker Installation (Recommended)

Option B: Local Installation

Recommended VSCode Extensions

Data

Download from Xena Browser (Genomics Data)

Download from TCIA (Imaging Data)

Docker Usage

Installation as a CLI Tool

Documentation

Project Structure

Additional Resources

License

AI Disclosure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 16

Uh oh!

Languages

Packages