Hi there, thanks for the amazing work! I found expert parallel is not compatible with the distributed optimizer in the fork version of Megatron-LM here:
https://github.com/stanford-futuredata/Megatron-LM/blob/85f95aef3b648075fe6f291c86714fdcbd9cd1f5/megatron/arguments.py#L352-L356
But there's no such validation in the open PR to Megatron-LM: NVIDIA/Megatron-LM#288
Does that mean the assertion is redundant and the current version of megablocks is compatible with the distributed optimizer under expert parallelism?
Thanks very much.