Skip to content

Conversation

@Bhoy1
Copy link

@Bhoy1 Bhoy1 commented Jan 25, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Adds first-class multi-agent support across the library with per-actor rollouts, scoring, and training.

  • Introduces Actor, Protocol, MultiAgentEnv, and MultiAgentRubric for multi-agent turn management, spawning, per-actor trajectory tagging, and per-actor rewards/advantages
  • MultiAgentEnv.generate() flattens game states into per-actor states and computes per-actor GRPO advantages; results now include actor_id
  • New MultiAgentOrchestrator drives training via Protocol.generate() and builds microbatches from flattened, per-actor trajectories
  • Exports updated in verifiers/__init__.py; eval_utils.save_rollout_results writes actor_id
  • Example environments: rock_paper_scissors (simultaneous moves with custom rollout) and twenty_questions (alternating turns, asymmetric actors), each with datasets and rubrics for per-actor rewards

Written by Cursor Bugbot for commit 1e5f474. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

from .envs.actor import Actor # noqa # isort: skip
from .envs.protocol import EpisodeRequest, GenerateResult, Protocol # noqa # isort: skip
from .envs.multiagent_env import MultiAgentEnv # noqa # isort: skip
from .rubrics.multiagent_rubric import MultiAgentRubric # noqa # isort: skip
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New multi-agent classes lack documentation updates

Medium Severity · Bugbot Rules

This PR adds major new user-facing classes (Actor, Protocol, MultiAgentEnv, MultiAgentRubric, MultiAgentOrchestrator) exported in __all__, but no corresponding documentation updates are included. Per the review rules, any PR adding core user-facing functionality needs to update relevant documentation in docs/environments.md, docs/training.md, and docs/reference.md.

Additional Locations (2)

Fix in Cursor Fix in Web

env_name = inp.get("task") or self._get_default_env()
env = self.get_env(env_name)
if env.rubric:
await env.rubric.score_rollout(state, score_sem=score_sem)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn() uses wrong scoring method for multi-agent rubrics

Medium Severity

The spawn() method calls score_rollout() for scoring when score=True (the default). However, MultiAgentRubric stores per-actor reward functions in actor_reward_funcs, which score_rollout() (inherited from parent Rubric) does not use—it only processes functions in self.funcs. Additionally, score_rollout() does not compute advantages, which are required for GRPO training. Spawned multi-agent states would have incomplete rewards and missing advantages.

Fix in Cursor Fix in Web

@Bhoy1 Bhoy1 marked this pull request as draft January 25, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants