Skip to content

Benchmark datasets and evaluation tools for reconstructing cell lineage trees from phenotypic data

License

Notifications You must be signed in to change notification settings

kuang-da/CellTreeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CellTreeBench

A benchmark dataset for cell lineage tree reconstruction from single-cell data. This repository is the dataset part of CellTreeQM (https://kuang-da.github.io/CellTreeQM-page/).

Overview

CellTreeBench provides standardized datasets and evaluation tools for benchmarking methods that reconstruct cell lineage trees from single-cell RNA sequencing data and other molecular measurements. The package includes multiple real and simulated datasets, preprocessing utilities, evaluation metrics, and ready-to-use train/test splits for supervised learning approaches.

This repository is under active development: we’re enhancing the dataset class, expanding the preprocessing pipeline, and integrating additional modalities.

Features

  • Multiple Dataset Types: Sythetic datasets by brownian motion, C. elegans lineage data, C. briggsae, Cripsr-barcode, and Species-level DNA methylation data
  • Standardized Splits: Pre-computed train/test splits with biological and technical sampling strategies
  • Evaluation Metrics: Distance metrics and reconstruction evaluation tools
  • Tree Operations: Utilities for tree manipulation, quartet generation, and lineage analysis
  • Easy-to-Use API: Simple functions to load datasets with consistent interfaces
  • Robust Configuration: Automatic data directory detection with environment variable overrides

Installation

From Source (Development)

git clone https://github.com/kuang-da/CellTreeBench.git
# Build the devcontainer
# Then cd into the root directory /workspaces/CellTreeBench
pip install -e .

Requirements

  • Python ≥ 3.8
  • Dependencies will be installed automatically via pip

Configuration (Optional)

# Set custom data directory (optional)
export CELLTREEBENCH_DATA_DIR="/path/to/your/data"

# Set custom package root (rarely needed)
export CELLTREEBENCH_ROOT="/path/to/package/root"

The package automatically detects data directories, but you can override defaults using environment variables for custom installations.

Quick Start

Loading C. elegans Dataset

from celltreebench.datasets.celegans import load_celegans_supervised_split

# Load train and test splits (data_dir auto-detected)
train_dataset, test_dataset = load_celegans_supervised_split(
    dataset_name="celegans_small",  # or "celegans_mid", "celegans_large"
    lineage_name="P0",
    out_dir="./output",
    sampling_method="biological",
    seed=42
)

print(f"Train shape: {train_dataset.data_normalized.shape}")
print(f"Test shape: {test_dataset.data_normalized.shape}")

Configuration Management

# Access configuration programmatically
from celltreebench.config import get_data_root, get_dataset_path

print(f"Data directory: {get_data_root()}")
print(f"Dataset path: {get_dataset_path('celegans_small')}")

Available Datasets

Dataset Type Description Variants
C. elegans Early embryonic development lineage trees with scRNA-seq celegans_small, celegans_mid, celegans_large
C. briggsae Early embryonic development lineage trees with scRNA-seq briggsae_mid
Tumor Cancer phylogenies from single-cell data Multiple cancer types
DNA Methylation DNA methylation-based lineage reconstruction Species-specific datasets
Mutation Mutation-based lineage tracking Various experimental systems
ARW Simulated data using Random Walk models Configurable simulation parameters

Examples

Basic Usage

See examples/supervised_split_example.ipynb for a complete walkthrough of:

  • Loading different dataset sizes
  • Understanding data formats
  • Working with train/test splits
  • Accessing lineage trees and metadata

Advanced Usage

# Custom quartet sampling
train_dataset, test_dataset = load_celegans_supervised_split(
    dataset_name="celegans_mid",
    sampling_method="technical",
    quartet_sampling_method="stratified",
    subset_leaves=100  # Use subset for faster prototyping
)

# Access tree topology and distances
topology_tree = train_dataset.topology_tree
distance_matrix = train_dataset.ref_dists

Data Format

Each dataset provides:

  • Gene Expression: Normalized count matrices (data_normalized)
  • Lineage Trees: Ground truth tree topologies in various formats
  • Distance Matrices: Precomputed reference distances between cells
  • Metadata: Cell annotations, developmental stages, lineage information
  • Quartets: Precomputed quartet relationships for training/evaluation

Evaluation Tools

The package includes utilities for evaluating reconstruction methods:

from celltreebench.utils.reconstruction_eval import evaluate_reconstruction
from celltreebench.utils.distance_metrics import compute_tree_distance

# Evaluate reconstructed tree against ground truth
metrics = evaluate_reconstruction(predicted_tree, ground_truth_tree)

Project Structure

CellTreeBench/
├── src/celltreebench/
│   ├── datasets/          # Dataset loaders and processors
│   ├── utils/             # Utility functions
│   ├── metrics/           # Evaluation metrics
│   ├── config.py          # Centralized configuration management
│   └── curation/          # Data curation tools
├── examples/              # Jupyter notebook examples
├── tests/                 # Comprehensive test suite (30+ tests)
│   ├── test_config.py     # Configuration system tests
│   └── test_datasets_integration.py  # Dataset integration tests
├── notebooks/             # Jupyter notebooks for data curation
└── data/                  # Dataset storage (not included in repo)

Testing

Run the test suite to verify your installation:

cd tests
python -m pytest -v

The test suite includes 30+ unit and integration tests covering configuration, path discovery, dataset loading, and backward compatibility.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Benchmark datasets and evaluation tools for reconstructing cell lineage trees from phenotypic data

Resources

License

Stars

Watchers

Forks

Packages

No packages published