You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a ~520 M model, but I have found that the megablocks moe version uses substantially more memory and takes longer to train than a dense model of corresponding size. I am using a model embedding dimension of 1536. The moe model has 48 experts with 8 active and and expert size of 128. I set lbl loss weight to 0.001.