Skip to content

Conversation

@polinabinder1
Copy link
Collaborator

@polinabinder1 polinabinder1 commented Jan 19, 2026

Summary

Introduces chunked dataset format for SCDL, splitting large monolithic datasets into smaller chunks for more efficient loading and future remote storage support.

Key Features

  • Chunked Format: Convert large SCDL datasets into smaller chunks (default 100k rows each)
  • Lazy Loading: Each chunk is memory-mapped independently
  • Seamless API: Chunked datasets work with existing SingleCellMemMapDataset interface

Changes

File Purpose
partition_scdl.py Utility to convert monolithic SCDL → chunked format
single_cell_memmap_dataset.py Added to_chunked() method and chunked loading support

Usage

Convert existing SCDL to chunked format

from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset

Load existing dataset

ds = SingleCellMemMapDataset("/path/to/scdl")

Convert to chunked (100k rows per chunk)

chunked_ds = ds.to_chunked("/path/to/chunked_scdl", chunk_size=100_000)

Use normally

print(len(chunked_ds)) # Same row count
row = chunked_ds.get_row(0) # Works transparently

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@polinabinder1
Copy link
Collaborator Author

/coderabbit

@polinabinder1
Copy link
Collaborator Author

@CodeRabbit

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@polinabinder1, hello! How can I assist you with this pull request? Feel free to ask me to:

  • Review specific changes or files
  • Generate code or unit tests
  • Explain certain parts of the code
  • Search for information in the codebase
  • Or anything else you'd like help with!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@polinabinder1, I'm here! How can I help you with this PR? I can:

  • Review the changes in this pull request
  • Explain specific code or changes
  • Generate unit tests for new functionality
  • Search the codebase for specific patterns or usage
  • Answer questions about the implementation
  • Create issues for follow-up work
  • Or anything else you need assistance with

Just let me know what you'd like me to do! 🐰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant