[advoptm] Spectral Normalization for Muon Variants#1263
[advoptm] Spectral Normalization for Muon Variants#1263Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Conversation
|
Update 1: I have made the weight decay for this method static and "decoupled" from the learning rate. |
|
Update 2:
|
|
I tested this, and it works very well. I used the same hyperparameters for LoRA/finetuning across SDXL, Chroma, and Zib; it trained successfully for all of them and delivered very solid results. I’ve found the baseline for all of them to be:
From there, you can adjust as needed (e.g., a higher Batch Size requires a larger LR, BF16 needs a higher LR, etc.). The weight decay is constant and differs from standard implementations, meaning it maintains the same effect regardless of whether the LR is high or low. The formula for it is: ❗ Note: that this method does not work with DoRAs, as DoRA has its own scaling which conflicts with this approach. It also behaves unpredictably with OFT (not sure, it trains at 0.1 LR). ❕ For full finetuning, 1D vectors are trained using AuxAdam, so you should use standard AdamW LR and weight decay settings for those. |
More Helpful Notes for LoRA Using This Method1) Rank-Invariant Updates: It is interesting to note that using this method for LoRA completely cancels out the rank effect (assuming alpha = rank). In this scenario, rank is cancelled out, allowing us to apply the full finetuning scaling rule: This achieves the same learning rate as full finetuning and results in rank-invariant updates. 2) Addressing the LoRA A-Matrix Muon appears to be sub-optimal for LoRA because the A matrix is often extremely "flat" or exhibits extreme dimensions, leading to unstable or "garbage" orthogonalization. Mathematically, we achieve:
These are my own findings, as the original paper did not experiment with LoRAs. Nonetheless, these results are very promising for LoRA/Muon combinations. |
TL;DR: Tune Once, Train Anywhere
This PR implements the spectral normalization/scaling proposed in Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales (NeurIPS 2025) for
Muon_advandAdaMuon_adv.This method allows you to tune hyperparameters (LR, Weight Decay) just once, and they will transfer to any model size.
Important Notes:
alpha=rankto disable the internal scaling!Other Notes:
More Info:
Koratahiu/Advanced_Optimizers#14