[DeepSpeed-Chat] Fix OOM issue in dataloader#841
Open
youkaichao wants to merge 1 commit intodeepspeedai:masterfrom
Open
[DeepSpeed-Chat] Fix OOM issue in dataloader#841youkaichao wants to merge 1 commit intodeepspeedai:masterfrom
youkaichao wants to merge 1 commit intodeepspeedai:masterfrom
Conversation
Author
|
@microsoft-github-policy-service agree |
Author
|
Hi, team, any feedback on this 👀 |
Contributor
Hi @youkaichao - sorry we didn't get to this until now. Would you want to fix the merge conflicts? |
Author
|
feel free to take it over if you think this is still useful. I'm not working on this thread any more :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, DeepSpeed-Chat directly saves tokenized tensors on disk, which consumes hundreds GB of memory. For each string, it will be converted to max_seq_len of attention_mask and input_ids, stored in int32 or int64.
If we count about 2~3 char per token, then tokenized tensors can take on average hundreds of byte in storage. This is very problematic, and when the prompt dataset becomes larger (say 1GB), the on-disk dataset can be hundreds of GB.
What's worse, DeepSpeed-Chat will load these data in memory, which can require hundreds of GB of memory.
Per my personal experience, my 1.1GB prompt dataset incurs OOM in a 512GB machine, even if I'm just using 512 as max_seq_len. If I want to use 2048 as max_seq_len, that would be four times more memory, i.e. 2TB :(
This PR only saves the string, and tokenizes the string on-the-fly. The saved data are about the same size of the input dataset.