CellTreeBench

A benchmark dataset for cell lineage tree reconstruction from single-cell data. This repository is the dataset part of CellTreeQM (https://kuang-da.github.io/CellTreeQM-page/).

Overview

CellTreeBench provides standardized datasets and evaluation tools for benchmarking methods that reconstruct cell lineage trees from single-cell RNA sequencing data and other molecular measurements. The package includes multiple real and simulated datasets, preprocessing utilities, evaluation metrics, and ready-to-use train/test splits for supervised learning approaches.

This repository is under active development: we’re enhancing the dataset class, expanding the preprocessing pipeline, and integrating additional modalities.

Features

Multiple Dataset Types: Sythetic datasets by brownian motion, C. elegans lineage data, C. briggsae, Cripsr-barcode, and Species-level DNA methylation data
Standardized Splits: Pre-computed train/test splits with biological and technical sampling strategies
Evaluation Metrics: Distance metrics and reconstruction evaluation tools
Tree Operations: Utilities for tree manipulation, quartet generation, and lineage analysis
Easy-to-Use API: Simple functions to load datasets with consistent interfaces
Robust Configuration: Automatic data directory detection with environment variable overrides

Installation

From Source (Development)

git clone https://github.com/kuang-da/CellTreeBench.git
# Build the devcontainer
# Then cd into the root directory /workspaces/CellTreeBench
pip install -e .

Requirements

Python ≥ 3.8
Dependencies will be installed automatically via pip

Configuration (Optional)

# Set custom data directory (optional)
export CELLTREEBENCH_DATA_DIR="/path/to/your/data"

# Set custom package root (rarely needed)
export CELLTREEBENCH_ROOT="/path/to/package/root"

The package automatically detects data directories, but you can override defaults using environment variables for custom installations.

Quick Start

Loading C. elegans Dataset

from celltreebench.datasets.celegans import load_celegans_supervised_split

# Load train and test splits (data_dir auto-detected)
train_dataset, test_dataset = load_celegans_supervised_split(
    dataset_name="celegans_small",  # or "celegans_mid", "celegans_large"
    lineage_name="P0",
    out_dir="./output",
    sampling_method="biological",
    seed=42
)

print(f"Train shape: {train_dataset.data_normalized.shape}")
print(f"Test shape: {test_dataset.data_normalized.shape}")

Configuration Management

# Access configuration programmatically
from celltreebench.config import get_data_root, get_dataset_path

print(f"Data directory: {get_data_root()}")
print(f"Dataset path: {get_dataset_path('celegans_small')}")

Available Datasets

Dataset Type	Description	Variants
C. elegans	Early embryonic development lineage trees with scRNA-seq	`celegans_small`, `celegans_mid`, `celegans_large`
C. briggsae	Early embryonic development lineage trees with scRNA-seq	`briggsae_mid`
Tumor	Cancer phylogenies from single-cell data	Multiple cancer types
DNA Methylation	DNA methylation-based lineage reconstruction	Species-specific datasets
Mutation	Mutation-based lineage tracking	Various experimental systems
ARW	Simulated data using Random Walk models	Configurable simulation parameters

Examples

Basic Usage

See examples/supervised_split_example.ipynb for a complete walkthrough of:

Loading different dataset sizes
Understanding data formats
Working with train/test splits
Accessing lineage trees and metadata

Advanced Usage

# Custom quartet sampling
train_dataset, test_dataset = load_celegans_supervised_split(
    dataset_name="celegans_mid",
    sampling_method="technical",
    quartet_sampling_method="stratified",
    subset_leaves=100  # Use subset for faster prototyping
)

# Access tree topology and distances
topology_tree = train_dataset.topology_tree
distance_matrix = train_dataset.ref_dists

Data Format

Each dataset provides:

Gene Expression: Normalized count matrices (data_normalized)
Lineage Trees: Ground truth tree topologies in various formats
Distance Matrices: Precomputed reference distances between cells
Metadata: Cell annotations, developmental stages, lineage information
Quartets: Precomputed quartet relationships for training/evaluation

Evaluation Tools

The package includes utilities for evaluating reconstruction methods:

from celltreebench.utils.reconstruction_eval import evaluate_reconstruction
from celltreebench.utils.distance_metrics import compute_tree_distance

# Evaluate reconstructed tree against ground truth
metrics = evaluate_reconstruction(predicted_tree, ground_truth_tree)

Project Structure

CellTreeBench/
├── src/celltreebench/
│   ├── datasets/          # Dataset loaders and processors
│   ├── utils/             # Utility functions
│   ├── metrics/           # Evaluation metrics
│   ├── config.py          # Centralized configuration management
│   └── curation/          # Data curation tools
├── examples/              # Jupyter notebook examples
├── tests/                 # Comprehensive test suite (30+ tests)
│   ├── test_config.py     # Configuration system tests
│   └── test_datasets_integration.py  # Dataset integration tests
├── notebooks/             # Jupyter notebooks for data curation
└── data/                  # Dataset storage (not included in repo)

Testing

Run the test suite to verify your installation:

cd tests
python -m pytest -v

The test suite includes 30+ unit and integration tests covering configuration, path discovery, dataset loading, and backward compatibility.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
examples		examples
notebooks/preprocess		notebooks/preprocess
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CellTreeBench

Overview

Features

Installation

From Source (Development)

Requirements

Configuration (Optional)

Quick Start

Loading C. elegans Dataset

Configuration Management

Available Datasets

Examples

Basic Usage

Advanced Usage

Data Format

Evaluation Tools

Project Structure

Testing

License

About

Uh oh!

Releases 1

Packages

Languages

License

kuang-da/CellTreeBench

Folders and files

Latest commit

History

Repository files navigation

CellTreeBench

Overview

Features

Installation

From Source (Development)

Requirements

Configuration (Optional)

Quick Start

Loading C. elegans Dataset

Configuration Management

Available Datasets

Examples

Basic Usage

Advanced Usage

Data Format

Evaluation Tools

Project Structure

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages