running scdl on remote #1425

polinabinder1 · 2026-01-19T02:17:46Z

Remote Chunked SCDL: Scalable Data Loading from Cloud Storage

Summary

Introduces remote loading capabilities for chunked SCDL datasets, enabling efficient training on datasets larger than local storage by streaming chunks from S3/GCS with intelligent caching and prefetching.

Key Features

Remote Loading: Stream chunks from S3, GCS, or HTTP with LRU caching
Smart Sampling: ChunkAwareSampler minimizes cache thrashing by iterating chunk-by-chunk
Prefetching: Background downloads keep GPU fed while processing current chunks
Configurable Caching: Control memory usage with max_cached_chunks

New Files

File	Purpose
`remote_chunk_loader.py`	Downloads, caches, and manages remote chunks with LRU eviction
`chunk_sampler.py`	PyTorch sampler that iterates in chunk-aware windows
`chunked_scdl_benchmark.py`	Benchmark comparing local vs chunked vs remote performance

Usage

1. Upload chunked dataset to S3/GCS

First convert to chunked format (see scdl_chunks branch)

aws s3 sync /path/to/chunked_scdl s3://my-bucket/chunked_scdl/

2. Load from remote storage

from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset

ds = SingleCellMemMapDataset.from_remote(
    remote_path="s3://my-bucket/chunked_scdl/",
    cache_dir="/tmp/scdl_cache",
    max_cached_chunks=3,  # Keep 3 chunks in local cache
    storage_options={"endpoint_url": "https://s3.amazonaws.com"},  # Optional
)

3. Use with ChunkAwareSampler for efficient iteration

from torch.utils.data import DataLoader
from bionemo.scdl.io.chunk_sampler import ChunkAwareSampler

sampler = ChunkAwareSampler(
    ds,
    chunks_per_window=2,         # Process 2 chunks at a time
    shuffle_chunks=True,         # Randomize chunk order each epoch
    shuffle_within_window=True,  # Shuffle rows within each window
)

loader = DataLoader(ds, batch_size=64, sampler=sampler)

for batch in loader:
    # Chunks are prefetched automatically
    pass

copy-pr-bot · 2026-01-19T02:17:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-19T02:17:52Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

running scdl on remote

2fee9df

polinabinder1 added 4 commits January 18, 2026 19:09

remote dataloader works

69196ef

scdl code

0ee85a1

Moving tracking to a seperate branch

0c8e423

Bring partition_scdl changes from scdl_remote_profile

42d429b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running scdl on remote #1425

running scdl on remote #1425

Uh oh!

polinabinder1 commented Jan 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

running scdl on remote #1425

Are you sure you want to change the base?

running scdl on remote #1425

Uh oh!

Conversation

polinabinder1 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Remote Chunked SCDL: Scalable Data Loading from Cloud Storage

Summary

Key Features

New Files

Usage

1. Upload chunked dataset to S3/GCS

First convert to chunked format (see scdl_chunks branch)

2. Load from remote storage

3. Use with ChunkAwareSampler for efficient iteration

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polinabinder1 commented Jan 19, 2026 •

edited

Loading