A benchmark dataset for cell lineage tree reconstruction from single-cell data. This repository is the dataset part of CellTreeQM (https://kuang-da.github.io/CellTreeQM-page/).
CellTreeBench provides standardized datasets and evaluation tools for benchmarking methods that reconstruct cell lineage trees from single-cell RNA sequencing data and other molecular measurements. The package includes multiple real and simulated datasets, preprocessing utilities, evaluation metrics, and ready-to-use train/test splits for supervised learning approaches.
This repository is under active development: we’re enhancing the dataset class, expanding the preprocessing pipeline, and integrating additional modalities.
- Multiple Dataset Types: Sythetic datasets by brownian motion, C. elegans lineage data, C. briggsae, Cripsr-barcode, and Species-level DNA methylation data
- Standardized Splits: Pre-computed train/test splits with biological and technical sampling strategies
- Evaluation Metrics: Distance metrics and reconstruction evaluation tools
- Tree Operations: Utilities for tree manipulation, quartet generation, and lineage analysis
- Easy-to-Use API: Simple functions to load datasets with consistent interfaces
- Robust Configuration: Automatic data directory detection with environment variable overrides
git clone https://github.com/kuang-da/CellTreeBench.git
# Build the devcontainer
# Then cd into the root directory /workspaces/CellTreeBench
pip install -e .- Python ≥ 3.8
- Dependencies will be installed automatically via pip
# Set custom data directory (optional)
export CELLTREEBENCH_DATA_DIR="/path/to/your/data"
# Set custom package root (rarely needed)
export CELLTREEBENCH_ROOT="/path/to/package/root"The package automatically detects data directories, but you can override defaults using environment variables for custom installations.
from celltreebench.datasets.celegans import load_celegans_supervised_split
# Load train and test splits (data_dir auto-detected)
train_dataset, test_dataset = load_celegans_supervised_split(
dataset_name="celegans_small", # or "celegans_mid", "celegans_large"
lineage_name="P0",
out_dir="./output",
sampling_method="biological",
seed=42
)
print(f"Train shape: {train_dataset.data_normalized.shape}")
print(f"Test shape: {test_dataset.data_normalized.shape}")# Access configuration programmatically
from celltreebench.config import get_data_root, get_dataset_path
print(f"Data directory: {get_data_root()}")
print(f"Dataset path: {get_dataset_path('celegans_small')}")| Dataset Type | Description | Variants |
|---|---|---|
| C. elegans | Early embryonic development lineage trees with scRNA-seq | celegans_small, celegans_mid, celegans_large |
| C. briggsae | Early embryonic development lineage trees with scRNA-seq | briggsae_mid |
| Tumor | Cancer phylogenies from single-cell data | Multiple cancer types |
| DNA Methylation | DNA methylation-based lineage reconstruction | Species-specific datasets |
| Mutation | Mutation-based lineage tracking | Various experimental systems |
| ARW | Simulated data using Random Walk models | Configurable simulation parameters |
See examples/supervised_split_example.ipynb for a complete walkthrough of:
- Loading different dataset sizes
- Understanding data formats
- Working with train/test splits
- Accessing lineage trees and metadata
# Custom quartet sampling
train_dataset, test_dataset = load_celegans_supervised_split(
dataset_name="celegans_mid",
sampling_method="technical",
quartet_sampling_method="stratified",
subset_leaves=100 # Use subset for faster prototyping
)
# Access tree topology and distances
topology_tree = train_dataset.topology_tree
distance_matrix = train_dataset.ref_distsEach dataset provides:
- Gene Expression: Normalized count matrices (
data_normalized) - Lineage Trees: Ground truth tree topologies in various formats
- Distance Matrices: Precomputed reference distances between cells
- Metadata: Cell annotations, developmental stages, lineage information
- Quartets: Precomputed quartet relationships for training/evaluation
The package includes utilities for evaluating reconstruction methods:
from celltreebench.utils.reconstruction_eval import evaluate_reconstruction
from celltreebench.utils.distance_metrics import compute_tree_distance
# Evaluate reconstructed tree against ground truth
metrics = evaluate_reconstruction(predicted_tree, ground_truth_tree)CellTreeBench/
├── src/celltreebench/
│ ├── datasets/ # Dataset loaders and processors
│ ├── utils/ # Utility functions
│ ├── metrics/ # Evaluation metrics
│ ├── config.py # Centralized configuration management
│ └── curation/ # Data curation tools
├── examples/ # Jupyter notebook examples
├── tests/ # Comprehensive test suite (30+ tests)
│ ├── test_config.py # Configuration system tests
│ └── test_datasets_integration.py # Dataset integration tests
├── notebooks/ # Jupyter notebooks for data curation
└── data/ # Dataset storage (not included in repo)
Run the test suite to verify your installation:
cd tests
python -m pytest -vThe test suite includes 30+ unit and integration tests covering configuration, path discovery, dataset loading, and backward compatibility.
This project is licensed under the MIT License - see the LICENSE file for details.