Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 23, 2026

Description

two issues with the way we previously handled the GenerateOutputs type:

  • it first built a Dataset which often runs into pyarrow issues when saving (e.g. on complex types like info)
  • because it is in column-order handling per-row is super awkward. in particular, if one row has an issue with saving, the entire saving op fails.

this pr refactors the GenerateOutputs to contain row-ordered rollout outputs and new saving utilities which write the .jsonl outputs line-by-line, catching exceptions on a per-rollout basis so that as much outputs as possible are saved at all times. the exact output format of results.jsonl and metadata.json are fully backwards-compatible and still work with prime eval push

class RolloutOutput(TypedDict):
    """TypedDict for generation outputs."""

    example_id: int
    prompt: Messages
    completion: Messages
    answer: str
    task: str
    info: Info
    tools: dict
    reward: float
    metrics: dict[str, float]
    stop_condition: str | None
    is_truncated: bool
    timing: RolloutTiming
    error: Error | None

class GenerateOutputs(TypedDict):
    rollouts: list[RolloutOutput]
    metadata: GenerateMetadata

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

@mikasenghaas
Copy link
Member Author

@cursoragent review

@cursor
Copy link

cursor bot commented Jan 23, 2026

I pulled the PR diff and reviewed the changes across environment.py, types.py, eval_utils.py, and the new save_utils.py.

High-level

  • The shift to row-ordered rollouts and line-by-line JSONL saving is a solid improvement and aligns with the stated goals (avoids pyarrow issues, better resilience).
  • The new utilities (states_to_rollout_outputs, states_to_generate_metadata, save_generate_outputs) cleanly separate concerns and make the flow clearer.
  • Propagation of the new save_to_hf_hub/hf_hub_dataset_name through generate/evaluate looks good.

Blocking/critical fixes

  • Per-row save resilience is not implemented: save_to_disk writes all rollouts without a try/except per line. One bad rollout (e.g., unserializable info) will still abort the entire save. Please wrap each json.dump in a try/except, log the failure, and continue; ideally return/save counts of saved/failed rows.
  • info is not sanitized. Complex/non-JSON types inside info can still break json.dump. Add a safe serializer (e.g., default= in json.dump) and/or a recursive sanitization that:
    • converts numpy types to Python primitives
    • converts set/Path/datetime/exceptions to strings or structured forms
    • bounds very large blobs if needed
  • Type mismatches in RolloutOutput:
    • tools is declared dict but populated from state["oai_tools"] (a list). Make it list[ChatCompletionToolParam] | None (and default to [] or None) to match reality.
    • You append arbitrary state_columns into rollout_output[col], but RolloutOutput is a fixed TypedDict. Static type-checkers will flag this. Options:
      • Make RolloutOutput total=False and explicitly list optional keys you plan to include, or
      • Add an extras: dict[str, Any] field to hold selected columns, or
      • Change the annotation of rollouts to list[dict[str, Any]] if you prefer flexibility over strict typing.
  • Backward-compatibility caveat: Environment.make_dataset now returns Dataset.from_list(sanitize_rollouts(...)) with nested metrics and timing. Previously, make_dataset flattened metrics and exposed timing columns (generation_ms, etc.). If consumers rely on those columns (including downstream analytics/Hub viewers), consider flattening as before for this method to preserve behavior.

Non-blocking polish

  • sanitize_metadata drops date. If the saved metadata.json previously contained date and you’re claiming full backwards compatibility, consider keeping it (you already omit path_to_save, which makes sense).
  • File writes: specify encoding="utf-8" and ensure_ascii=False for json.dump; optionally flush() after each line.
  • Logging: after save, log counts of saved/failed rows; on per-row failure, include a short context (e.g., example_id) and continue.
  • Minor: tools_key could just build a tuple of names (no need to str() the sorted list).
  • Add brief docstrings to new public helpers in save_utils.py to clarify guarantees (e.g., “sanitized and JSON-serializable”).

Tests to add/update

  • Saving with a non-serializable object in info should save all other rows and log one failure.
  • Environment.make_dataset behavior (flattened vs nested) — whichever you choose, add a test to lock it in.
  • Round-trip: generated results.jsonl and metadata.json are consumable by prime eval push (as claimed).
  • Type-level: a test (and mypy run) for RolloutOutput shape after including state_columns.

Overall, the refactor is a nice simplification and sets you up for more robust persistence. Addressing per-row exception handling and the typing mismatches will make it production-safe and keep compatibility promises.

Open in Cursor Open in Web

@mikasenghaas mikasenghaas force-pushed the overhaul-results-saving branch from b9f35f5 to d089d2c Compare January 25, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants