Polinabinder/scdl chunks #1424

polinabinder1 · 2026-01-19T00:54:40Z

Summary

Introduces chunked dataset format for SCDL, splitting large monolithic datasets into smaller chunks for more efficient loading and future remote storage support.

Key Features

Chunked Format: Convert large SCDL datasets into smaller chunks (default 100k rows each)
Lazy Loading: Each chunk is memory-mapped independently
Seamless API: Chunked datasets work with existing SingleCellMemMapDataset interface

Changes

File	Purpose
`partition_scdl.py`	Utility to convert monolithic SCDL → chunked format
`single_cell_memmap_dataset.py`	Added `to_chunked()` method and chunked loading support

Usage

Convert existing SCDL to chunked format

from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset

Load existing dataset

ds = SingleCellMemMapDataset("/path/to/scdl")

Convert to chunked (100k rows per chunk)

chunked_ds = ds.to_chunked("/path/to/chunked_scdl", chunk_size=100_000)

Use normally

print(len(chunked_ds)) # Same row count
row = chunked_ds.get_row(0) # Works transparently

copy-pr-bot · 2026-01-19T00:54:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-19T00:54:57Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

polinabinder1 · 2026-01-19T00:54:59Z

/coderabbit

polinabinder1 · 2026-01-19T00:55:22Z

@CodeRabbit

coderabbitai · 2026-01-19T00:55:37Z

Tip

For best results, initiate chat on the files or code changes.

@polinabinder1, hello! How can I assist you with this pull request? Feel free to ask me to:

Review specific changes or files
Generate code or unit tests
Explain certain parts of the code
Search for information in the codebase
Or anything else you'd like help with!

coderabbitai · 2026-01-19T00:57:18Z

Tip

For best results, initiate chat on the files or code changes.

@polinabinder1, I'm here! How can I help you with this PR? I can:

Review the changes in this pull request
Explain specific code or changes
Generate unit tests for new functionality
Search the codebase for specific patterns or usage
Answer questions about the implementation
Create issues for follow-up work
Or anything else you need assistance with

Just let me know what you'd like me to do! 🐰

Signed-off-by: Polina Binder <pbinder@nvidia.com>

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

polinabinder1 added 2 commits January 18, 2026 16:06

partitioning SCDL

b631b0c

some more code clean up

5a8950c

Merge branch 'main' into polinabinder/scdl_chunks

e84ca30

polinabinder1 added 5 commits January 18, 2026 17:02

simplifying the PR

6a9a738

Signed-off-by: Polina Binder <pbinder@nvidia.com>

moving the files around

cb53d6a

Delete sub-packages/bionemo-scdl/tests/bionemo/scdl/scripts/__init__.py

397b893

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

more chunked implementation

51fd0e3

adding test file

2919a91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polinabinder/scdl chunks #1424

Polinabinder/scdl chunks #1424

Uh oh!

polinabinder1 commented Jan 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Review skipped

Uh oh!

polinabinder1 commented Jan 19, 2026

Uh oh!

polinabinder1 commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Polinabinder/scdl chunks #1424

Are you sure you want to change the base?

Polinabinder/scdl chunks #1424

Uh oh!

Conversation

polinabinder1 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Changes

Usage

Convert existing SCDL to chunked format

Load existing dataset

Convert to chunked (100k rows per chunk)

Use normally

Uh oh!

copy-pr-bot bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Review skipped

Uh oh!

polinabinder1 commented Jan 19, 2026

Uh oh!

polinabinder1 commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Uh oh!

coderabbitai bot commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

polinabinder1 commented Jan 19, 2026 •

edited

Loading