Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) by Rypo · Pull Request #150 · VectorSpaceLab/OmniGen

Rypo · 2024-11-28T02:19:01Z

Changes

Removed non_blocking=True from all .to("cpu") calls.
Slightly tweaked .synchronize() calls (saves ~10 sec/50 iter when offloading)

Some environments, notably WSL, don't fully support memory pinning / concurrent CPU-GPU access. ¹ Removing non_blocking to .to(cpu) calls resolves unexpected cuda OOM errors.

From my (limited) understanding of how non_blocking operates under the hood, this shouldn't negatively impact performance. ²

In testing, I found the bf16 timings were actually 10-30s lower than those reported in the Different inference settings table, but other code changes I made beforehand may have influenced that as well.

Example of Error

import torch
device = torch.device('cuda:0')

def print_mem_free(device=None):
    mem_free, mem_total = torch.cuda.mem_get_info(device)
    print(f'Mem Free: {mem_free/(1024**3):0.2f} GB')

print_mem_free(device)
>>> Mem Free: 22.76 GB

r = torch.rand(1_000_000_000, dtype=torch.float32, device=device) # 4GB
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu")
torch.cuda.empty_cache()
print_mem_free(device)
>>> Mem Free: 22.76 GB

r = r.to(device)
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu", non_blocking=True)
>>>
    RuntimeError: CUDA error: out of memory
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is the second of 3 PRs I'm issuing to improve performance/fix errors. I've tried to keep each incremental change as small in scope as possible. PRs: 1. #149, 2. This, 3. #151

Update (2024-12-02):

This PR now branches off main. It is no longer a dependant of Reduce initial pipeline load time by 4-5x (1/3) #149. See Adds support for 4bit (nf4) and 8bit bitsandbytes quantization (3/3) #151 for discussion.

Removes non_blocking argument from all device to cpu transfers. In certain environments (e.g. WSL) large transfers will throw a CUDA memory error regardless of VRAM available. Adjusts stream synchronize for modest performance gains with cpu_offload. fixes VectorSpaceLab#90, fixes VectorSpaceLab#117

This was referenced Nov 28, 2024

Reduce initial pipeline load time by 4-5x (1/3) #149

Open

Adds support for 4bit (nf4) and 8bit bitsandbytes quantization (3/3) #151

Open

Rypo force-pushed the fix_non_blocking branch from 2fd6a5d to 7383566 Compare December 2, 2024 22:46

Rypo added 3 commits December 3, 2024 08:21

fix: revert layer offload iteration

0fa5f5d

Merge branch 'main' into fix_non_blocking

98679fa

Merge branch 'main' into fix_non_blocking

329b876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3)#150

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3)#150
Rypo wants to merge 4 commits intoVectorSpaceLab:mainfrom
Rypo:fix_non_blocking

Rypo commented Nov 28, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rypo commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example of Error

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rypo commented Nov 28, 2024 •

edited

Loading