[advoptm] Spectral Normalization for Muon Variants by Koratahiu · Pull Request #1263 · Nerogar/OneTrainer

Koratahiu · 2026-01-18T21:34:22Z

TL;DR: Tune Once, Train Anywhere

This PR implements the spectral normalization/scaling proposed in Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales (NeurIPS 2025) for Muon_adv and AdaMuon_adv.

This method allows you to tune hyperparameters (LR, Weight Decay) just once, and they will transfer to any model size.

Cross-Model: Train SD1.5, SDXL, and Flux using the exact same LR/WD.
Cross-Rank: A Rank 1 LoRA typically uses ~1e-3 LR. With this, you can use that same 1e-3 LR for Rank 1, Rank 128, and beyond.

Important Notes:

LoRA Alpha: The built-in scaling of LoRA alpha negatively interacts with this method. Set alpha=rank to disable the internal scaling!
OFT: OFT uses matrices with unique (skewed) dimensions (Rank vs. Total Elements). While this is compatible, the extreme aspect ratio is sub-optimal for matrix-aware optimizers like Muon. It will work, but likely requires a different LR range than standard methods (0.1 LR worked for me).

Other Notes:

Suggested Ranges: Start with 1e-3 LR for standard LoRAs/Finetunes. For OFT, you may need to go up to 0.1.
High Robustness: This method is extremely stable. You can often multiply the LR by 10x and still maintain very similar validation loss.
Unified Rates: You can typically use the exact same (or very similar) LR for both the UNet/DiT and Text Encoders.
Stable & Tested: This method has been tested on SDXL LoRA, OFT, and Full Finetuning with solid results. However, I am leaving this as a draft/dev version to collect more feedback.

More Info:
Koratahiu/Advanced_Optimizers#14

Koratahiu · 2026-01-21T09:52:48Z

Update 1:
I wouldn't recommend this method or the orthogonal optimizers for OFT. Its shape and mechanics do not work well with them.
But I trained it using 0.1 LR, it trains but not optimal.

I have made the weight decay for this method static and "decoupled" from the learning rate.
The paper recommends a weight decay value of 0.1; however, this needs more testing.

Koratahiu · 2026-01-29T11:06:36Z

Update 2:

This is incompatible with DoRA (muP/spectral scaling conflicts with DoRA scaling).
The optimal value for weight decay is what the paper found:
0.1 (hyperparameter value) * 1/width (scaling rule)
This worked very well for me for both LoRAs and finetunes.

Koratahiu · 2026-01-31T17:45:23Z

I tested this, and it works very well.

I used the same hyperparameters for LoRA/finetuning across SDXL, Chroma, and Zib; it trained successfully for all of them and delivered very solid results.

I’ve found the baseline for all of them to be:

Learning Rate (LR): 1e-3
Weight Decay: 0.1

From there, you can adjust as needed (e.g., a higher Batch Size requires a larger LR, BF16 needs a higher LR, etc.).

The weight decay is constant and differs from standard implementations, meaning it maintains the same effect regardless of whether the LR is high or low. The formula for it is:
Weight_decay (hyperparameter value) * 1/width (scaling rule)

❗ Note: that this method does not work with DoRAs, as DoRA has its own scaling which conflicts with this approach. It also behaves unpredictably with OFT (not sure, it trains at 0.1 LR).

❕ For full finetuning, 1D vectors are trained using AuxAdam, so you should use standard AdamW LR and weight decay settings for those.

Koratahiu · 2026-02-01T09:53:05Z

More Helpful Notes for LoRA Using This Method

1) Rank-Invariant Updates:

It is interesting to note that using this method for LoRA completely cancels out the rank effect (assuming alpha = rank).
Its update rule can be simplified as:
ΔW = A (√ height/rank) * B (√ rank/width)

In this scenario, rank is cancelled out, allowing us to apply the full finetuning scaling rule:
ΔW (A*B) = √(height/width)

This achieves the same learning rate as full finetuning and results in rank-invariant updates.
This leads to a universal, shared LR across all ranks, which aligns perfectly with the goal of this method: tuning once for all ranks and adapters.

2) Addressing the LoRA A-Matrix

Muon appears to be sub-optimal for LoRA because the A matrix is often extremely "flat" or exhibits extreme dimensions, leading to unstable or "garbage" orthogonalization.
However, spectral scaling seems to resolve this. By calculating a very high eps, the orthogonalization is heavily dampened. This forces Muon to behave more like normalized SGD.

Mathematically, we achieve:

Rank-invariant updates.
A stabilized A matrix.

These are my own findings, as the original paper did not experiment with LoRAs. Nonetheless, these results are very promising for LoRA/Muon combinations.

Koratahiu added 5 commits January 19, 2026 00:11

initial

e432c14

dev1

b359a15

dev2

972ee77

dev3

1d49175

dev4

6ca6f52

add Chroma residual filter

e40579a

stable 2.2 and edit rms tooltip

2bc6ae1

Koratahiu marked this pull request as ready for review January 31, 2026 18:39

Koratahiu added 2 commits January 31, 2026 21:43

remove the print

31287b2

use .values()

44cca26

Koratahiu mentioned this pull request Feb 12, 2026

Fix and Enhance LoRA-Muon Setup: Orthogonalize B, Adam A #1314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[advoptm] Spectral Normalization for Muon Variants#1263

[advoptm] Spectral Normalization for Muon Variants#1263
Koratahiu wants to merge 9 commits intoNerogar:masterfrom
Koratahiu:SN_MUON

Koratahiu commented Jan 18, 2026

Uh oh!

Koratahiu commented Jan 21, 2026

Uh oh!

Koratahiu commented Jan 29, 2026

Uh oh!

Koratahiu commented Jan 31, 2026

Uh oh!

Koratahiu commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Koratahiu commented Jan 18, 2026

Important Notes:

Other Notes:

Uh oh!

Koratahiu commented Jan 21, 2026

Uh oh!

Koratahiu commented Jan 29, 2026

Uh oh!

Koratahiu commented Jan 31, 2026

Uh oh!

Koratahiu commented Feb 1, 2026

More Helpful Notes for LoRA Using This Method

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant