-
Notifications
You must be signed in to change notification settings - Fork 116
Polinabinder/scdl chunks #1424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Polinabinder/scdl chunks #1424
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/coderabbit |
|
Tip For best results, initiate chat on the files or code changes.
|
|
Tip For best results, initiate chat on the files or code changes.
Just let me know what you'd like me to do! 🐰 |
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Summary
Introduces chunked dataset format for SCDL, splitting large monolithic datasets into smaller chunks for more efficient loading and future remote storage support.
Key Features
SingleCellMemMapDatasetinterfaceChanges
partition_scdl.pysingle_cell_memmap_dataset.pyto_chunked()method and chunked loading supportUsage
Convert existing SCDL to chunked format
from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset
Load existing dataset
ds = SingleCellMemMapDataset("/path/to/scdl")
Convert to chunked (100k rows per chunk)
chunked_ds = ds.to_chunked("/path/to/chunked_scdl", chunk_size=100_000)
Use normally
print(len(chunked_ds)) # Same row count
row = chunked_ds.get_row(0) # Works transparently