Skip to content

Conversation

@polinabinder1
Copy link
Collaborator

@polinabinder1 polinabinder1 commented Jan 19, 2026

Remote Chunked SCDL: Scalable Data Loading from Cloud Storage

Summary

Introduces remote loading capabilities for chunked SCDL datasets, enabling efficient training on datasets larger than local storage by streaming chunks from S3/GCS with intelligent caching and prefetching.

Key Features

  • Remote Loading: Stream chunks from S3, GCS, or HTTP with LRU caching
  • Smart Sampling: ChunkAwareSampler minimizes cache thrashing by iterating chunk-by-chunk
  • Prefetching: Background downloads keep GPU fed while processing current chunks
  • Configurable Caching: Control memory usage with max_cached_chunks

New Files

File Purpose
remote_chunk_loader.py Downloads, caches, and manages remote chunks with LRU eviction
chunk_sampler.py PyTorch sampler that iterates in chunk-aware windows
chunked_scdl_benchmark.py Benchmark comparing local vs chunked vs remote performance

Usage

1. Upload chunked dataset to S3/GCS

First convert to chunked format (see scdl_chunks branch)

aws s3 sync /path/to/chunked_scdl s3://my-bucket/chunked_scdl/

2. Load from remote storage

from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset

ds = SingleCellMemMapDataset.from_remote(
    remote_path="s3://my-bucket/chunked_scdl/",
    cache_dir="/tmp/scdl_cache",
    max_cached_chunks=3,  # Keep 3 chunks in local cache
    storage_options={"endpoint_url": "https://s3.amazonaws.com"},  # Optional
)

3. Use with ChunkAwareSampler for efficient iteration

from torch.utils.data import DataLoader
from bionemo.scdl.io.chunk_sampler import ChunkAwareSampler

sampler = ChunkAwareSampler(
    ds,
    chunks_per_window=2,         # Process 2 chunks at a time
    shuffle_chunks=True,         # Randomize chunk order each epoch
    shuffle_within_window=True,  # Shuffle rows within each window
)

loader = DataLoader(ds, batch_size=64, sampler=sampler)

for batch in loader:
    # Chunks are prefetched automatically
    pass

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant