Skip to content

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible)#1246

Draft
hcsolakoglu wants to merge 39 commits intoSWivid:mainfrom
hcsolakoglu:rl-integration
Draft

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible)#1246
hcsolakoglu wants to merge 39 commits intoSWivid:mainfrom
hcsolakoglu:rl-integration

Conversation

@hcsolakoglu
Copy link
Contributor

With this PR, I'm integrating the RL workflow of the F5R into the F5-TTS while maintaining the default deterministic behavior and checkpoint compliance. Goal is to enable a two‑stage pipeline (Gaussian NLL warmup + GRPO RL
fine‑tuning) with a modular reward system and opt‑in robustness improvements, without changing the default training or inference paths.

Key changes:

  • Probabilistic output head (proj_out_ln_sig) with gaussian_nll objective and backward‑compatible checkpoint loading.
  • GRPO trainer and RL sampling utilities, with optional steps_plus_one and prompt‑length modes.
  • Pluggable reward system (RewardProvider, registry, combiner) + built‑in FunASR WER and WeSpeaker similarity providers (optional deps, lazy import, caching).
  • Reward logging improvements and optional Trackio support (drop‑in for W&B).
  • Optional stability knobs for GRPO (rl.kl_eps, rl.density_eps) while keeping F5R‑parity defaults.
  • Dynamic batch sampler optimization to avoid materializing repeated batches in memory.
  • Extensive tests covering Gaussian head, checkpoint compatibility, RL training step, reward plugins, device handling, and new opt‑ins.

Notes on compatibility:

  • Defaults remain deterministic (output_dist=deterministic, objective=mse), so existing training/inference and checkpoints work unchanged.

  • All deviations from F5R behavior are opt‑in and documented in README_RL.md.

  • README_RL.md updated with a concise RL runbook, dataset prep, reward model fetch, and recommended opt‑ins.

@hcsolakoglu
Copy link
Contributor Author

I have several ideas on how to initialize the probabilistic output head, so I will be implementing and testing multiple approaches. This is still a work in progress, but I have made significant headway. If anyone would like to guide the direction, feel free to run tests and share your feedback. @SWivid

@yuekaizhang
Copy link
Contributor

@hcsolakoglu
Looking forward to your update! I tried running the F5R-TTS code, but I couldn't reproduce the same results as the paper.

The main issue is that the multiple samples from the rollout stage lack diversity. There's no difference in prosody or voice tone between them—it just sounds like the same audio with some random noise added.

When you finish the first-stage prob model training, I was wondering if you saw anything like this on your end?

@hcsolakoglu
Copy link
Contributor Author

@hcsolakoglu Looking forward to your update! I tried running the F5R-TTS code, but I couldn't reproduce the same results as the paper.

The main issue is that the multiple samples from the rollout stage lack diversity. There's no difference in prosody or voice tone between them—it just sounds like the same audio with some random noise added.

When you finish the first-stage prob model training, I was wondering if you saw anything like this on your end?

Honestly, I was planning to run a hyperparameter search and test different methods for this(head init, stage 1 and 2, but I didn’t get the chance to do training runs beyond 100–200 steps. I’m actively working on multiple projects at the same time; when I have the time, I’ll run a proper hyperparameter search and share more detailed information here. In the meantime, everyone is free to test it. Code is largely complete; only detailed testing and ablation studies remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants