overhaul saving outputs #774

mikasenghaas · 2026-01-23T15:46:10Z

Description

two issues with the way we previously handled the GenerateOutputs type:

it first built a Dataset which often runs into pyarrow issues when saving (e.g. on complex types like info)
because it is in column-order handling per-row is super awkward. in particular, if one row has an issue with saving, the entire saving op fails.

this pr refactors the GenerateOutputs to contain row-ordered rollout outputs and new saving utilities which write the .jsonl outputs line-by-line, catching exceptions on a per-rollout basis so that as much outputs as possible are saved at all times. the exact output format of results.jsonl and metadata.json are fully backwards-compatible and still work with prime eval push

class RolloutOutput(TypedDict):
    """TypedDict for generation outputs."""

    example_id: int
    prompt: Messages
    completion: Messages
    answer: str
    task: str
    info: Info
    tools: dict
    reward: float
    metrics: dict[str, float]
    stop_condition: str | None
    is_truncated: bool
    timing: RolloutTiming
    error: Error | None

class GenerateOutputs(TypedDict):
    rollouts: list[RolloutOutput]
    metadata: GenerateMetadata

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

mikasenghaas · 2026-01-23T15:47:13Z

@cursoragent review

cursor · 2026-01-23T15:47:21Z

I pulled the PR diff and reviewed the changes across environment.py, types.py, eval_utils.py, and the new save_utils.py.

High-level

The shift to row-ordered rollouts and line-by-line JSONL saving is a solid improvement and aligns with the stated goals (avoids pyarrow issues, better resilience).
The new utilities (states_to_rollout_outputs, states_to_generate_metadata, save_generate_outputs) cleanly separate concerns and make the flow clearer.
Propagation of the new save_to_hf_hub/hf_hub_dataset_name through generate/evaluate looks good.

Blocking/critical fixes

Per-row save resilience is not implemented: save_to_disk writes all rollouts without a try/except per line. One bad rollout (e.g., unserializable info) will still abort the entire save. Please wrap each json.dump in a try/except, log the failure, and continue; ideally return/save counts of saved/failed rows.
info is not sanitized. Complex/non-JSON types inside info can still break json.dump. Add a safe serializer (e.g., default= in json.dump) and/or a recursive sanitization that:
- converts numpy types to Python primitives
- converts set/Path/datetime/exceptions to strings or structured forms
- bounds very large blobs if needed
Type mismatches in RolloutOutput:
- tools is declared dict but populated from state["oai_tools"] (a list). Make it list[ChatCompletionToolParam] | None (and default to [] or None) to match reality.
- You append arbitrary state_columns into rollout_output[col], but RolloutOutput is a fixed TypedDict. Static type-checkers will flag this. Options:
  - Make RolloutOutput total=False and explicitly list optional keys you plan to include, or
  - Add an extras: dict[str, Any] field to hold selected columns, or
  - Change the annotation of rollouts to list[dict[str, Any]] if you prefer flexibility over strict typing.
Backward-compatibility caveat: Environment.make_dataset now returns Dataset.from_list(sanitize_rollouts(...)) with nested metrics and timing. Previously, make_dataset flattened metrics and exposed timing columns (generation_ms, etc.). If consumers rely on those columns (including downstream analytics/Hub viewers), consider flattening as before for this method to preserve behavior.

Non-blocking polish

sanitize_metadata drops date. If the saved metadata.json previously contained date and you’re claiming full backwards compatibility, consider keeping it (you already omit path_to_save, which makes sense).
File writes: specify encoding="utf-8" and ensure_ascii=False for json.dump; optionally flush() after each line.
Logging: after save, log counts of saved/failed rows; on per-row failure, include a short context (e.g., example_id) and continue.
Minor: tools_key could just build a tuple of names (no need to str() the sorted list).
Add brief docstrings to new public helpers in save_utils.py to clarify guarantees (e.g., “sanitized and JSON-serializable”).

Tests to add/update

Saving with a non-serializable object in info should save all other rows and log one failure.
Environment.make_dataset behavior (flattened vs nested) — whichever you choose, add a test to lock it in.
Round-trip: generated results.jsonl and metadata.json are consumable by prime eval push (as claimed).
Type-level: a test (and mypy run) for RolloutOutput shape after including state_columns.

Overall, the refactor is a nice simplification and sets you up for more robust persistence. Addressing per-row exception handling and the typing mismatches will make it production-safe and keep compatibility promises.

mikasenghaas added 6 commits January 25, 2026 12:13

do not use dataset for local saving

c8515d0

correctly convert generate outputs

dc6e80e

rewrite generate outputs to row-order

ff76495

only save once

d2352c5

repr errors in saved outputs

b6fe5ef

move to save_utils

d089d2c

mikasenghaas force-pushed the overhaul-results-saving branch from b9f35f5 to d089d2c Compare January 25, 2026 12:16

mikasenghaas added 5 commits January 25, 2026 12:23

try..except around saving utils

51f3adc

flatten metrics

c5da446

pop answer and info if not present

86201b2

some more comments

713acb9

deprecate ProcessedOutputs

005b15f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhaul saving outputs #774

overhaul saving outputs #774

Uh oh!

mikasenghaas commented Jan 23, 2026 •

edited

Loading

Uh oh!

mikasenghaas commented Jan 23, 2026

Uh oh!

cursor bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

overhaul saving outputs #774

Are you sure you want to change the base?

overhaul saving outputs #774

Uh oh!

Conversation

mikasenghaas commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

mikasenghaas commented Jan 23, 2026

Uh oh!

cursor bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High-level

Blocking/critical fixes

Non-blocking polish

Tests to add/update

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jan 23, 2026 •

edited

Loading

cursor bot commented Jan 23, 2026 •

edited

Loading