A comprehensive toolkit for cancer genomics analysis and biomarker discovery using RNA-seq data from The Cancer Genome Atlas (TCGA). OncoLearn leverages machine learning and statistical methods for cancer subtyping and identifying potential diagnostic and prognostic markers.
Aryan Sharan Guda (aryanshg@andrew.cmu.edu), Seungjin Han (seungjih@andrew.cmu.edu), Seohyun Lee (seohyun4@andrew.cmu.edu), Yosen Lin (yosenl@andrew.cmu.edu), Isha Parikh (parikh.i@northeastern.edu), Diya Patidar (dpatidar@andrew.cmu.edu), Arunannamalai Sujatha Bharath Raj (asujatha@andrew.cmu.edu), Andrew Scouten (yzb2@txstate.edu), Jeffrey Wang (jdw2@andrew.cmu.edu), Qiyu (Charlie) Yang (qiyuy@andrew.cmu.edu), Xinru Zhang (mayzxr2203@gmail.com), River Zhu (riverz@andrew.cmu.edu), Zhaoyi (Zoey) You (zhaoyiyou.zoey@gmail.com), Heena Dalal (dalalhina@gmail.com/heena.dalal@kcl.ac.uk)
-
Install Docker Desktop from docker.com
-
Clone and setup:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive # For NVIDIA GPUs: docker compose --profile nvidia up -d # For AMD GPUs (native Linux): docker compose --profile amd up -d # For AMD GPUs (WSL2): docker compose --profile amd-wsl up -d
-
Download sample data:
# Download genomics data from Xena Browser docker compose exec dev oncolearn download --xena --cohorts BRCA # Download imaging data from TCIA (manifest only) docker compose exec dev oncolearn download --tcia --cohorts BRCA # Download imaging data from TCIA (manifest + images) docker compose exec dev oncolearn download --tcia --cohorts BRCA --download-images
-
Start exploring with the Jupyter notebooks in
notebooks/data/
For detailed setup options and local installation, see Getting Started.
This project supports two installation methods:
Option A: Docker (Recommended)
- Docker Desktop or Docker Engine
- Docker Compose
- VSCode with Dev Containers extension (optional but recommended)
Option B: Local Installation
- Python 3.10+
- R 4.0+
- uv - Fast Python package installer and resolver
Docker provides a consistent development environment and eliminates dependency and compatibility issues.
-
Install Docker Desktop:
- Download from docker.com
- Or install Docker Engine on Linux
-
Clone the repository:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive -
Start the environment:
# For NVIDIA GPUs: docker compose --profile nvidia up -d # For AMD GPUs (native Linux): docker compose --profile amd up -d # For AMD GPUs (WSL2 on Windows): docker compose --profile amd-wsl up -d
Note: The Docker setup includes GPU support for both NVIDIA and AMD GPUs. Choose the appropriate profile based on your hardware:
nvidia: For NVIDIA GPUsamd: For AMD GPUs on native Linuxamd-wsl: For AMD GPUs on Windows Subsystem for Linux 2 (WSL2)
-
Open in VSCode Dev Container (optional):
- Install the Dev Containers extension
- Press
F1→ "Dev Containers: Reopen in Container" - VSCode will connect to the container with all extensions and tools configured
- Jupyter notebooks (
.ipynbfiles) will work natively in VSCode without a browser
Useful Docker Commands:
# Stop containers
docker compose down
# Rebuild after dependency changes (use your GPU profile)
docker compose --profile nvidia build # or --profile amd or --profile amd-wsl
docker compose --profile nvidia up -d # or --profile amd or --profile amd-wsl
# Execute commands in container
docker compose exec dev bash # NVIDIA
docker compose exec dev-amd bash # AMD (native Linux)
docker compose exec dev-amd-wsl bash # AMD (WSL2)
# Add new Python packages
docker compose exec dev uv add <package-name> # NVIDIA
docker compose exec dev-amd uv add <package-name> # AMD (native Linux)
docker compose exec dev-amd-wsl uv add <package-name> # AMD (WSL2)
# View running containers
docker compose ps-
Install uv (if not already installed) from here.
-
Clone the repository:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive -
Install Python dependencies:
# Install base dependencies uv sync # Or install with PyTorch extras (choose one based on your hardware): uv sync --extra cpu # CPU-only version uv sync --extra cu128 # CUDA 12.8 uv sync --extra cu130 # CUDA 13.0 uv sync --extra rocm # AMD ROCm
-
Install R dependencies with renv:
# Install renv if not already installed install.packages("renv") # Restore R package dependencies renv::restore()
For the best development experience, we recommend installing the following VSCode extensions:
- Python (
ms-python.python) - IntelliSense, debugging, and linting for Python - Ruff (
charliermarsh.ruff) - Fast Python linter and formatter - autopep8 (
ms-python.autopep8) - Python code formatter following PEP 8 style guide - R (
REditorSupport.r) - R language support with syntax highlighting and code execution - Jupyter (
ms-toolsai.jupyter) - Interactive Jupyter notebook support - Dev Containers (
ms-vscode-remote.remote-containers) - For Docker development (if using Docker)
OncoLearn provides a unified download script for acquiring cancer data from multiple sources:
# Download a single cohort (all data types)
oncolearn download --xena --cohorts BRCA
# Download and extract gzipped files
oncolearn download --xena --cohorts BRCA --unzip
# Download specific data category
oncolearn download --xena --cohorts BRCA --category mutation
# Download multiple cohorts
oncolearn download --xena --cohorts BRCA,LUAD,ACC
# Download all available cohorts
oncolearn download --xena --all
# List available cohorts
oncolearn download --xena --listAvailable categories: clinical, mutation, cnv, mrna, mirna, protein, methylation
Note: By default, gzipped files are NOT automatically extracted. Use --unzip to extract them after download.
# Download manifest file only
oncolearn download --tcia --cohorts BRCA
# Download manifest and images (requires nbia-data-retriever)
oncolearn download --tcia --cohorts BRCA --download-images
# Download multiple cohorts with images
oncolearn download --tcia --cohorts BRCA,LUAD --download-images
# List available cohorts
oncolearn download --tcia --listNote: The --download-images flag requires the nbia-data-retriever tool to be installed.
When using Docker, prefix commands with the container execution:
# NVIDIA GPU container
docker compose exec dev oncolearn download --xena --cohorts BRCA
# AMD GPU container
docker compose exec dev-amd oncolearn download --tcia --cohorts BRCA --download-imagesAfter installation, oncolearn is available as a command-line tool:
# Using uv (development)
uv run oncolearn download --xena --cohorts BRCA
# After pip install (production)
pip install -e .
oncolearn download --xena --cohorts BRCAComprehensive guides and documentation are available in the docs/ folder:
- TCGA Data Download Guide - Detailed instructions for downloading and managing TCGA datasets
- TCIA Data Download Guide - Guide for downloading imaging data from TCIA
- GitHub Authentication Setup - Configure SSH authentication for GitHub access
- Models Documentation - Overview of machine learning models and architectures
data/- Data storage directory (downloaded TCGA datasets)docs/- Project documentation and guidesnotebooks/- Jupyter notebooks for data exploration and analysisscripts/- Data download and preprocessing scriptsdownload.py- Unified CLI for downloading data from Xena Browser and TCIAdata/download_xena.py- Xena Browser download utilitiesdata/download_tcia.py- TCIA download utilities
src/oncolearn/- Core Python package for cancer genomics analysissrc/multimodal/- Multimodal learning framework for integrating multi-omic dataconfigs/- Configuration files for training and testing
For more information on downloading and working with TCGA data, see the TCGA Data Download Guide.
This project is licensed under the MIT License - see the LICENSE file for details.
Artificial intelligence tools, including large language models (LLMs), were used during the development of this project to support writing, clarify technical concepts, and assist in generating code snippets. These tools served as an aid for idea refinement, debugging, and improving the readability of explanations and documentation. All AI-generated text and code were thoroughly reviewed, verified for correctness, and understood in full before being incorporated into this work. The responsibility for all final decisions, interpretations, and implementations remains solely with the contributors.
