MOE uses more memory than dense model and is slower


I am training a ~520 M model, but I have found that the megablocks moe version uses substantially more memory and takes longer to train than a dense model of corresponding size.  I am using a model embedding dimension of 1536. The moe model has 48 experts with 8 active and and expert size of 128.  I set lbl loss weight to 0.001. 

<img width="398" alt="Image" src="https://github.com/user-attachments/assets/ce5f71bd-31e0-416c-aa9b-67cfe547e473" />
<img width="399" alt="Image" src="https://github.com/user-attachments/assets/cddc993e-afb8-437a-80f0-9b5f673d1088" />
<img width="399" alt="Image" src="https://github.com/user-attachments/assets/ad4586b7-72de-45ea-9a9c-c65bed20c7e2" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MOE uses more memory than dense model and is slower #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MOE uses more memory than dense model and is slower #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions