diff --git a/plugins/compound-engineering/.claude-plugin/plugin.json b/plugins/compound-engineering/.claude-plugin/plugin.json
index 842ed85c..f6636a99 100644
--- a/plugins/compound-engineering/.claude-plugin/plugin.json
+++ b/plugins/compound-engineering/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
{
"name": "compound-engineering",
- "version": "2.30.0",
+ "version": "2.31.0",
"description": "AI-powered development tools. 29 agents, 25 commands, 16 skills, 1 MCP server for code review, research, design, and workflow automation.",
"author": {
"name": "Kieran Klaassen",
diff --git a/plugins/compound-engineering/CHANGELOG.md b/plugins/compound-engineering/CHANGELOG.md
index e55f5c76..02035b91 100644
--- a/plugins/compound-engineering/CHANGELOG.md
+++ b/plugins/compound-engineering/CHANGELOG.md
@@ -5,6 +5,30 @@ All notable changes to the compound-engineering plugin will be documented in thi
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.31.0] - 2026-02-12
+
+### Changed
+
+- **`/workflows:review` command** — Rewritten as v2 with context-managed map-reduce architecture
+ - **Architecture**: Phased File-Based Map-Reduce — sub-agents write full analysis to `.review/`, return ~100 tokens each to parent. Parent context stays under ~12k tokens vs ~30-50k in v1
+ - **6 phases**: Intent → Map (parallel specialists) → Validate → Judge (dedup + evidence-check) → Deep Analysis (stakeholder impact) → Reduce (todo creation)
+ - **Smart agent selection**: 5 always-run core agents + conditional agents based on PR file types (vs v1's "run ALL agents")
+ - **Cross-platform**: Uses `.review/` scratchpad (not `/tmp/`), Node.js validation (not Python3)
+ - **Judge phase**: Deduplicates findings, discards hallucinated file references, ranks by severity × confidence, caps at 15 findings
+ - **Deep Analysis phase**: Replaces v1's inline Ultra-Thinking with isolated sub-agent for stakeholder perspectives and scenario exploration on P1/P2 findings
+ - **Protected artifacts**: Excluded paths (docs/plans/, docs/solutions/) enforced in both agent prompts and judge phase
+ - **0-findings short-circuit**: Clean PRs get a summary without running deep analysis or todo creation
+
+### Fixed
+
+- Removed 4 ghost agent references (`rails-turbo-expert`, `dependency-detective`, `devops-harmony-analyst`, `code-philosopher`) that don't exist in the plugin
+- Added 5 missing agents to selection matrix: `kieran-typescript-reviewer`, `kieran-python-reviewer`, `julik-frontend-races-reviewer`, `schema-drift-detector`, `learnings-researcher`
+- Replaced Python3 JSON validation with Node.js (cross-platform — Python3 not guaranteed on all systems)
+- Fixed `.gitignore` append duplication on repeated runs (now uses `grep -q` guard)
+- Added excluded paths check to judge prompt (v1 only had it in agent prompts, not synthesis)
+
+---
+
## [2.30.0] - 2026-02-05
### Added
diff --git a/plugins/compound-engineering/commands/workflows/review.md b/plugins/compound-engineering/commands/workflows/review.md
index e6593034..62db4ce0 100644
--- a/plugins/compound-engineering/commands/workflows/review.md
+++ b/plugins/compound-engineering/commands/workflows/review.md
@@ -1,24 +1,36 @@
---
name: workflows:review
-description: Perform exhaustive code reviews using multi-agent analysis, ultra-thinking, and worktrees
+description: Perform exhaustive code reviews using multi-agent analysis with file-based synthesis to prevent context overflow
argument-hint: "[PR number, GitHub URL, branch name, or latest]"
---
-# Review Command
+# Review Command (v2 — Context-Managed)
- Perform exhaustive code reviews using multi-agent analysis, ultra-thinking, and Git worktrees for deep local inspection.
+Perform exhaustive code reviews using multi-agent analysis with file-based synthesis to prevent context overflow.
## Introduction
Senior Code Review Architect with expertise in security, performance, architecture, and quality assurance
+## Architecture: Why This Command Works
+
+This command uses the **Phased File-Based Map-Reduce** pattern — optimized for review quality while staying within context limits:
+
+1. **Intent Phase** (sequential) — A single agent analyzes the PR and produces a shared intent summary + architectural context. All specialists receive this so they start from the same understanding.
+2. **Map Phase** (parallel) — Specialist sub-agents write full analysis to `.review/`, return only a single sentence to the parent (~100 tokens each vs 2k-4k)
+3. **Validate** — Verifies all expected agent files exist (re-runs missing agents), then checks JSON schema compliance
+4. **Judge Phase** — A judge agent deduplicates, evidence-checks (discards hallucinated findings), ranks, and selects top-N
+5. **Deep Analysis Phase** — A sub-agent performs stakeholder impact and scenario analysis on P1/P2 findings, enriching the judged report
+6. **Reduce Phase** — The parent reads only `ENRICHED_FINDINGS.json` (~8k tokens max), then spawns todo-creation agents that read the actual code to produce high-quality, actionable todos
+
+This keeps the parent context under ~12k tokens of agent output regardless of how many agents run, while maximizing analytical depth at every stage.
+
## Prerequisites
- Git repository with GitHub CLI (`gh`) installed and authenticated
- Clean main/master branch
- Proper permissions to create worktrees and access the repository
-- For document reviews: Path to a markdown file or document
## Main Tasks
@@ -27,500 +39,545 @@ argument-hint: "[PR number, GitHub URL, branch name, or latest]"
#$ARGUMENTS
-
-First, I need to determine the review target type and set up the code for analysis.
-
-
-#### Immediate Actions:
-
-
- [ ] Determine review type: PR number (numeric), GitHub URL, file path (.md), or empty (current branch)
- [ ] Check current git branch
-- [ ] If ALREADY on the target branch (PR branch, requested branch name, or the branch already checked out for review) → proceed with analysis on current branch
-- [ ] If DIFFERENT branch than the review target → offer to use worktree: "Use git-worktree skill for isolated Call `skill: git-worktree` with branch name
-- [ ] Fetch PR metadata using `gh pr view --json` for title, body, files, linked issues
-- [ ] Set up language-specific analysis tools
-- [ ] Prepare security scanning environment
-- [ ] Make sure we are on the branch we are reviewing. Use gh pr checkout to switch to the branch or manually checkout the branch.
-
-Ensure that the code is ready for analysis (either in worktree or on current branch). ONLY then proceed to the next step.
-
+- [ ] If ALREADY on the target branch → proceed with analysis on current branch
+- [ ] If DIFFERENT branch → use git-worktree skill: `skill: git-worktree` with branch name
+- [ ] Fetch PR metadata using `gh pr view --json title,body,files,labels,milestone`
+- [ ] Get the diff: `gh pr diff ` — save to `.review/pr_diff.txt`
+- [ ] Make sure we are on the branch we are reviewing
-#### Protected Artifacts
+### 2. Prepare the Scratchpad Directory
-
-The following paths are compound-engineering pipeline artifacts and must never be flagged for deletion, removal, or gitignore by any review agent:
+
+Use a project-relative path, NOT /tmp/. The /tmp/ path causes two problems:
+1. Claude Code's Read tool and MCP filesystem tools cannot access /tmp/ (outside allowed directories)
+2. On Windows, /tmp/ resolves to different locations depending on the subprocess (MSYS2 vs literal C:\tmp), splitting agent files across directories
-- `docs/plans/*.md` — Plan files created by `/workflows:plan`. These are living documents that track implementation progress (checkboxes are checked off by `/workflows:work`).
-- `docs/solutions/*.md` — Solution documents created during the pipeline.
+Using .review/ inside the project avoids both issues. All tools (Read, Write, Bash, MCP) can access project-relative paths reliably.
+
-If a review agent flags any file in these directories for cleanup or removal, discard that finding during synthesis. Do not create a todo for it.
-
+```bash
+# Create the review session directory (project-relative, cross-platform safe)
+REVIEW_DIR=".review"
+rm -rf "$REVIEW_DIR"
+mkdir -p "$REVIEW_DIR"
+grep -q '.review/' .gitignore 2>/dev/null || echo "$REVIEW_DIR/" >> .gitignore # Ensure it's gitignored (idempotent)
+
+# Save PR metadata for agents to reference
+gh pr view --json title,body,files > "$REVIEW_DIR/pr_meta.json"
+gh pr diff > "$REVIEW_DIR/pr_diff.txt"
+```
-#### Parallel Agents to review the PR:
+Each sub-agent will READ from `$REVIEW_DIR/pr_diff.txt` and WRITE its findings to `$REVIEW_DIR/{agent_name}.json`.
-
+All references below use `.review/` — this is the ONLY path agents should use. Do not use `/tmp/` anywhere.
-Run ALL or most of these agents at the same time:
-
-1. Task kieran-rails-reviewer(PR content)
-2. Task dhh-rails-reviewer(PR title)
-3. If turbo is used: Task rails-turbo-expert(PR content)
-4. Task git-history-analyzer(PR content)
-5. Task dependency-detective(PR content)
-6. Task pattern-recognition-specialist(PR content)
-7. Task architecture-strategist(PR content)
-8. Task code-philosopher(PR content)
-9. Task security-sentinel(PR content)
-10. Task performance-oracle(PR content)
-11. Task devops-harmony-analyst(PR content)
-12. Task data-integrity-guardian(PR content)
-13. Task agent-native-reviewer(PR content) - Verify new features are agent-accessible
+### 3. PR Intent Analysis (Phase 0 — Sequential)
-
+
+Run this BEFORE launching specialist agents. This produces the shared context that all specialists receive, ensuring they start from the same understanding and spend their reasoning on domain-specific analysis rather than independently re-deriving PR intent.
+
-#### Conditional Agents (Run if applicable):
-
-
+```
+Task pr-intent-analyzer("
+You are a PR Intent Analyzer. Your job is to produce a concise, high-signal summary that specialist review agents will use as shared context.
+
+## Instructions:
+1. Read .review/pr_diff.txt and .review/pr_meta.json
+2. Write your analysis to .review/PR_INTENT.md (max 500 words)
+3. Your analysis MUST cover:
+ - **Intent**: What is this PR trying to accomplish? (2-3 sentences)
+ - **Approach**: How does it achieve this? Key architectural decisions made. (3-5 sentences)
+ - **Scope**: Files changed, components affected, boundaries of the change
+ - **Risk Surface**: Which areas carry the most risk — new code, refactored paths, changed interfaces, data model changes
+ - **Testing Gap**: What's tested vs what's not, based on test files in the diff
+4. Return to parent: 'Intent analysis complete. Written to .review/PR_INTENT.md'
+")
+```
-These agents are run ONLY when the PR matches specific criteria. Check the PR files list to determine if they apply:
+Wait for this to complete before launching specialists. The ~30 second cost pays for itself in higher-quality, less-redundant specialist findings.
-**If PR contains database migrations (db/migrate/*.rb files) or data backfills:**
+### 4. Launch Review Agents (Map Phase)
-14. Task data-migration-expert(PR content) - Validates ID mappings match production, checks for swapped values, verifies rollback safety
-15. Task deployment-verification-agent(PR content) - Creates Go/No-Go deployment checklist with SQL verification queries
+
+EVERY sub-agent prompt MUST include these output constraints AND the shared intent context. This is what prevents context overflow while maximizing quality.
-**When to run migration agents:**
-- PR includes files matching `db/migrate/*.rb`
-- PR modifies columns that store IDs, enums, or mappings
-- PR includes data backfill scripts or rake tasks
-- PR changes how data is read/written (e.g., changing from FK to string column)
-- PR title/body mentions: migration, backfill, data transformation, ID mapping
+Append this to EVERY agent spawn prompt:
-**What these agents check:**
-- `data-migration-expert`: Verifies hard-coded mappings match production reality (prevents swapped IDs), checks for orphaned associations, validates dual-write patterns
-- `deployment-verification-agent`: Produces executable pre/post-deploy checklists with SQL queries, rollback procedures, and monitoring plans
+```
+## SHARED CONTEXT
+Read .review/PR_INTENT.md first for PR intent, approach, and risk surface. Do not re-derive what the PR does — focus your full reasoning on your specialist domain.
+
+## OUTPUT RULES (MANDATORY)
+1. Read the diff from .review/pr_diff.txt
+2. Write your FULL analysis as JSON to .review/{your_agent_name}.json
+3. Use this EXACT schema for findings (hard caps enforced):
+ {
+ "agent_type": "security|performance|architecture|rails|patterns|simplicity|...",
+ "summary": "<500 chars max — your key takeaway>",
+ "findings_count": ,
+ "critical_count": ,
+ "findings": [
+ {
+ "severity": "p1|p2|p3",
+ "category": "<50 chars>",
+ "title": "<100 chars>",
+ "location": "path/to/file.ext:line_number",
+ "description": "<300 chars>",
+ "recommendation": "<200 chars>",
+ "confidence": 0.0-1.0
+ }
+ ]
+ }
+4. Max 10 findings per agent. Prioritize by severity. No prose analysis fields.
+5. Only report findings with confidence >= 0.6. Quality over quantity.
+6. EXCLUDED PATHS — do not analyze or report findings for:
+ - docs/plans/*.md
+ - docs/solutions/*.md
+7. Every finding MUST reference a real file path and line number from the diff. Do not report findings without specific location evidence.
+8. Return ONLY this to the parent (do NOT return the full analysis):
+ "Review complete. Wrote findings ( critical) to .review/{agent_name}.json. Key takeaway: <1 sentence>"
+```
+
-
+#### Agent Selection — Choose Based on PR Content
-### 4. Ultra-Thinking Deep Dive Phases
+
+Do NOT run all agents on every PR. Examine the PR file list and select relevant agents:
- For each phase below, spend maximum cognitive effort. Think step by step. Consider all angles. Question assumptions. And bring all reviews in a synthesis to the user.
+**Always Run (Core — 5 agents):**
+1. `architecture-strategist` — structural patterns, coupling, abstractions, component boundaries
+2. `security-sentinel` — vulnerabilities, auth, input validation, injection
+3. `performance-oracle` — N+1 queries, memory leaks, scaling concerns, resource exhaustion
+4. `code-simplicity-reviewer` — unnecessary complexity, dead code, simplification opportunities
+5. `pattern-recognition-specialist` — anti-patterns, inconsistencies, convention violations
-
-Complete system context map with component interactions
-
+**Run If Applicable (language-specific):**
+- `kieran-rails-reviewer` — only if PR has `.rb` files
+- `dhh-rails-reviewer` — only if PR has `.rb` files
+- `kieran-typescript-reviewer` — only if PR has `.ts`/`.tsx` files
+- `kieran-python-reviewer` — only if PR has `.py` files
+- `julik-frontend-races-reviewer` — only if PR has Stimulus controllers or frontend async code
-#### Phase 3: Stakeholder Perspective Analysis
+**Run If Applicable (domain-specific):**
+- `agent-native-reviewer` — only if PR adds user-facing features
+- `data-integrity-guardian` — only if PR modifies database queries or data flows
+- `git-history-analyzer` — if PR touches files with complex change history or frequent churn
+- `schema-drift-detector` — if PR includes `schema.rb` changes, cross-reference against included migrations
+- `learnings-researcher` — if PR touches areas where `docs/solutions/` may have relevant prior art
- ULTRA-THINK: Put yourself in each stakeholder's shoes. What matters to them? What are their pain points?
+**Migration-Specific (only for db/migrate/*.rb files):**
+- `data-migration-expert` — validates ID mappings, rollback safety
+- `deployment-verification-agent` — Go/No-Go deployment checklist
+
-
+#### Launch Pattern
-1. **Developer Perspective**
+
+Launch selected agents in parallel. Each agent reads the diff from the scratchpad and writes its findings there.
- - How easy is this to understand and modify?
- - Are the APIs intuitive?
- - Is debugging straightforward?
- - Can I test this easily?
+Example for a TypeScript PR (no Rails, no migrations):
-2. **Operations Perspective**
+```
+Task architecture-strategist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task security-sentinel("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task performance-oracle("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task code-simplicity-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task pattern-recognition-specialist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task kieran-typescript-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+```
- - How do I deploy this safely?
- - What metrics and logs are available?
- - How do I troubleshoot issues?
- - What are the resource requirements?
+Example for a Rails PR with migrations:
-3. **End User Perspective**
+```
+Task kieran-rails-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task dhh-rails-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task security-sentinel("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task performance-oracle("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task architecture-strategist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task data-migration-expert("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+Task schema-drift-detector("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES)
+```
- - Is the feature intuitive?
- - Are error messages helpful?
- - Is performance acceptable?
- - Does it solve my problem?
+Wait for ALL agents to complete.
+
-4. **Security Team Perspective**
+### 5. Verify and Validate Agent Outputs
- - What's the attack surface?
- - Are there compliance requirements?
- - How is data protected?
- - What are the audit capabilities?
+
+Two checks run before the judge: (1) verify every launched agent actually produced a file, (2) validate the JSON schema. This catches the silent-failure bug where an agent reports completion but its file is missing.
+
-5. **Business Perspective**
- - What's the ROI?
- - Are there legal/compliance risks?
- - How does this affect time-to-market?
- - What's the total cost of ownership?
+#### Step 5a: Verify All Expected Files Exist
-#### Phase 4: Scenario Exploration
+```bash
+# List expected agent files based on which agents were launched
+# (adjust this list to match the agents you actually spawned in Step 4)
+EXPECTED_AGENTS="architecture security performance simplicity patterns" # or whatever you launched
+
+MISSING=""
+for agent in $EXPECTED_AGENTS; do
+ if ! ls .review/${agent}*.json 1>/dev/null 2>&1; then
+ MISSING="$MISSING $agent"
+ echo "MISSING: $agent — no output file found in .review/"
+ fi
+done
+
+if [ -n "$MISSING" ]; then
+ echo "⚠️ Missing agent files:$MISSING"
+ echo "Re-run these agents before proceeding to judge."
+fi
+```
- ULTRA-THINK: Explore edge cases and failure scenarios. What could go wrong? How does the system behave under stress?
+**If any agent file is missing:** Re-launch that specific agent before proceeding. Do not skip to the judge — silent agent failures caused 10 findings (including 2 P1s) to be lost in testing.
-
+#### Step 5b: Validate JSON Schema
-- [ ] **Happy Path**: Normal operation with valid inputs
-- [ ] **Invalid Inputs**: Null, empty, malformed data
-- [ ] **Boundary Conditions**: Min/max values, empty collections
-- [ ] **Concurrent Access**: Race conditions, deadlocks
-- [ ] **Scale Testing**: 10x, 100x, 1000x normal load
-- [ ] **Network Issues**: Timeouts, partial failures
-- [ ] **Resource Exhaustion**: Memory, disk, connections
-- [ ] **Security Attacks**: Injection, overflow, DoS
-- [ ] **Data Corruption**: Partial writes, inconsistency
-- [ ] **Cascading Failures**: Downstream service issues
+```bash
+# Validate all agent output files (uses Node.js — guaranteed available since Claude Code requires it)
+for f in .review/*.json; do
+ [[ "$(basename "$f")" == "pr_meta.json" ]] && continue
+ node -e "
+const fs = require('fs');
+const f = '$f';
+const name = require('path').basename(f);
+try {
+ const data = JSON.parse(fs.readFileSync(f, 'utf8'));
+ if (!Array.isArray(data.findings)) throw new Error('findings not an array');
+ if (data.findings.length > 10) throw new Error('too many findings: ' + data.findings.length);
+ data.findings.forEach((finding, i) => {
+ if (!['p1','p2','p3'].includes(finding.severity)) throw new Error('finding ' + i + ': bad severity');
+ if (!finding.location) throw new Error('finding ' + i + ': missing location');
+ });
+ console.log('VALID:', name, '-', data.findings.length, 'findings');
+} catch (e) {
+ console.log('INVALID:', name, '-', e.message, '— removing from review');
+ fs.unlinkSync(f);
+}
+" 2>&1
+done
+```
-### 6. Multi-Angle Review Perspectives
+If an agent produced invalid output, it's removed before judging. The review continues with the valid agent outputs.
-#### Technical Excellence Angle
+**Checkpoint:** At this point, every launched agent should have a valid JSON file in `.review/`. If not, re-run the missing/invalid agents before proceeding.
-- Code craftsmanship evaluation
-- Engineering best practices
-- Technical documentation quality
-- Tooling and automation assessment
+### 6. Judge Phase — Deduplicate and Rank
-#### Business Value Angle
+
+Do NOT read individual agent JSON files into your context. Launch a JUDGE agent that reads them in its own context window.
+
-- Feature completeness validation
-- Performance impact on users
-- Cost-benefit analysis
-- Time-to-market considerations
+```
+Task judge-findings("
+You are a Code Review Judge. Consolidate findings from multiple review agents into a single, deduplicated, ranked report.
+
+## Instructions:
+1. Read ALL JSON files in .review/*.json (skip pr_meta.json and pr_diff.txt)
+2. Read .review/pr_meta.json to get the list of files actually in this PR
+3. Collect all findings across agents
+4. EVIDENCE CHECK: Discard any finding whose 'location' field does not reference a file path present in pr_meta.json. These are hallucinated findings.
+4b. EXCLUDED PATHS: Also discard any finding whose location is under docs/plans/ or docs/solutions/ — these are compound-engineering pipeline artifacts and must never be flagged.
+5. DEDUPLICATE: Remove semantically similar findings (keep the higher-confidence one)
+6. RESOLVE CONFLICTS: If agents contradict each other, note the conflict and pick the more evidence-based finding
+7. RANK by: severity (p1 > p2 > p3), then confidence score
+8. SELECT top 15 findings maximum. Only include findings with confidence >= 0.6.
+9. For each P1 finding, add a brief 'impact' field (<200 chars) assessing: deployment risk, user impact, and attack surface.
+10. Write the consolidated report to .review/JUDGED_FINDINGS.json using this schema:
+
+{
+ \"pr_number\": ,
+ \"total_raw_findings\": ,
+ \"duplicates_removed\": ,
+ \"hallucinated_removed\": ,
+ \"conflicts_resolved\": ,
+ \"final_findings\": [
+ {
+ \"id\": 1,
+ \"severity\": \"p1|p2|p3\",
+ \"category\": \"<50 chars>\",
+ \"title\": \"<100 chars>\",
+ \"location\": \"path/to/file.ext:line\",
+ \"description\": \"<300 chars>\",
+ \"recommendation\": \"<200 chars>\",
+ \"confidence\": 0.0-1.0,
+ \"source_agents\": [\"agent1\", \"agent2\"],
+ \"effort\": \"small|medium|large\",
+ \"impact\": \"<200 chars, P1 only, null for P2/P3>\"
+ }
+ ],
+ \"agent_summaries\": [
+ {\"agent\": \"name\", \"summary\": \"\"}
+ ]
+}
+
+11. Return to parent: 'Judging complete. raw findings → after dedup ( hallucinated removed). P1: , P2: , P3: . Report at .review/JUDGED_FINDINGS.json'
+")
+```
-#### Risk Management Angle
+**Short-circuit: If the judge returns 0 final findings**, skip Deep Analysis and Todo Creation. Present a clean summary:
-- Security risk assessment
-- Operational risk evaluation
-- Compliance risk verification
-- Technical debt accumulation
+```markdown
+## ✅ Code Review Complete — No Findings
-#### Team Dynamics Angle
+**Review Target:** PR #XXXX - [PR Title]
+**Branch:** [branch-name]
-- Code review etiquette
-- Knowledge sharing effectiveness
-- Collaboration patterns
-- Mentoring opportunities
+All [N] agents reviewed the PR. [X] raw findings were raised, but all were either hallucinated (wrong file paths), duplicates, or below confidence threshold after judging. No action items.
+```
-### 4. Simplification and Minimalism Review
+### 7. Deep Analysis Phase (P1/P2 Enrichment)
-Run the Task code-simplicity-reviewer() to see if we can simplify the code.
+
+After judging, spawn a deep-analysis agent to enrich findings. This runs in its own context window so it doesn't cost the parent anything, but it produces significantly richer findings for todo creation.
-### 5. Findings Synthesis and Todo Creation Using file-todos Skill
+This agent performs the Ultra-Thinking deep dive from the original review command — stakeholder perspectives, scenario exploration, and multi-angle analysis — but scoped to the judged findings rather than the raw diff, and in an isolated context window rather than the parent.
+
- ALL findings MUST be stored in the todos/ directory using the file-todos skill. Create todo files immediately after synthesis - do NOT present findings for user approval first. Use the skill for structured todo management.
+```
+Task deep-analysis("
+You are a Deep Analysis Reviewer. Your job is to enrich judged code review findings with deep stakeholder analysis, scenario exploration, and concrete fix suggestions.
-#### Step 1: Synthesize All Findings
+## Instructions:
+1. Read .review/JUDGED_FINDINGS.json
+2. Read .review/pr_diff.txt for code context
+3. Read .review/PR_INTENT.md for PR context
-
-Consolidate all agent reports into a categorized list of findings.
-Remove duplicates, prioritize by severity and impact.
-
+## For each P1 finding, perform ALL of the following:
-
+### Stakeholder Impact Analysis
+Put yourself in each stakeholder's shoes. Assess impact from each perspective:
-- [ ] Collect findings from all parallel agents
-- [ ] Discard any findings that recommend deleting or gitignoring files in `docs/plans/` or `docs/solutions/` (see Protected Artifacts above)
-- [ ] Categorize by type: security, performance, architecture, quality, etc.
-- [ ] Assign severity levels: 🔴 CRITICAL (P1), 🟡 IMPORTANT (P2), 🔵 NICE-TO-HAVE (P3)
-- [ ] Remove duplicate or overlapping findings
-- [ ] Estimate effort for each finding (Small/Medium/Large)
+- **Developer**: How easy is the affected code to understand and modify? Are the APIs intuitive? Is debugging straightforward? Can this be tested easily?
+- **Operations**: How do I deploy this safely? What metrics and logs are available? How do I troubleshoot if this goes wrong? What are the resource requirements?
+- **End User**: Is the feature intuitive? Are error messages helpful? Is performance acceptable? Does it solve their problem?
+- **Security**: What's the attack surface? Are there compliance requirements? How is data protected? What are the audit capabilities?
+- **Business**: What's the ROI impact? Are there legal/compliance risks? How does this affect time-to-market?
-
+Add a 'stakeholder_impact' object with a field for each perspective (each 1-3 sentences).
-#### Step 2: Create Todo Files Using file-todos Skill
+### Scenario Exploration
+Think through edge cases and failure scenarios. For the code involved in this finding, consider:
- Use the file-todos skill to create todo files for ALL findings immediately. Do NOT present findings one-by-one asking for user approval. Create all todo files in parallel using the skill, then summarize results to user.
+- Invalid/malformed inputs (null, empty, unexpected types)
+- Boundary conditions (min/max values, empty collections)
+- Concurrent access (race conditions, deadlocks)
+- Scale pressure (10x, 100x normal load)
+- Network issues (timeouts, partial failures)
+- Resource exhaustion (memory, disk, connections)
+- Security vectors (injection, overflow, DoS)
+- Data corruption (partial writes, inconsistency)
+- Cascading failures (downstream service issues)
-**Implementation Options:**
+Add a 'scenarios' array listing the 3-5 most relevant failure scenarios with specific trigger conditions and expected impact.
-**Option A: Direct File Creation (Fast)**
+### Suggested Fix
+Add a 'suggested_fix' field with a concrete code-level suggestion for resolution (<500 chars).
-- Create todo files directly using Write tool
-- All findings in parallel for speed
-- Use standard template from `.claude/skills/file-todos/assets/todo-template.md`
-- Follow naming convention: `{issue_id}-pending-{priority}-{description}.md`
+## For each P2 finding:
+- Add 'stakeholder_impact' with Developer and Operations perspectives (1-2 sentences each)
+- Add 'scenarios' with 1-2 most relevant failure scenarios
+- Add 'suggested_fix' (<300 chars)
-**Option B: Sub-Agents in Parallel (Recommended for Scale)** For large PRs with 15+ findings, use sub-agents to create finding files in parallel:
+## For P3 findings:
+- Pass through unchanged
-```bash
-# Launch multiple finding-creator agents in parallel
-Task() - Create todos for first finding
-Task() - Create todos for second finding
-Task() - Create todos for third finding
-etc. for each finding.
+## Output:
+Write the enriched report to .review/ENRICHED_FINDINGS.json (same schema as JUDGED_FINDINGS.json, plus the new fields).
+Return to parent: 'Deep analysis complete. Enriched P1 and P2 findings. Written to .review/ENRICHED_FINDINGS.json'
+")
```
-Sub-agents can:
-
-- Process multiple findings simultaneously
-- Write detailed todo files with all sections filled
-- Organize findings by severity
-- Create comprehensive Proposed Solutions
-- Add acceptance criteria and work logs
-- Complete much faster than sequential processing
-
-**Execution Strategy:**
-
-1. Synthesize all findings into categories (P1/P2/P3)
-2. Group findings by severity
-3. Launch 3 parallel sub-agents (one per severity level)
-4. Each sub-agent creates its batch of todos using the file-todos skill
-5. Consolidate results and present summary
-
-**Process (Using file-todos Skill):**
+### 8. Synthesis — Read ONLY the Enriched Findings
-1. For each finding:
+
+NOW read `.review/ENRICHED_FINDINGS.json` — this is the ONLY agent output file you read into your context. This file is pre-deduplicated, evidence-checked, ranked, enriched with impact analysis, and capped at ~15 findings.
+
- - Determine severity (P1/P2/P3)
- - Write detailed Problem Statement and Findings
- - Create 2-3 Proposed Solutions with pros/cons/effort/risk
- - Estimate effort (Small/Medium/Large)
- - Add acceptance criteria and work log
+Read the enriched findings file and proceed to create todos.
-2. Use file-todos skill for structured todo management:
+### 9. Create Todo Files
- ```bash
- skill: file-todos
- ```
+
+ALL findings from ENRICHED_FINDINGS.json MUST be stored in the todos/ directory. Create todo files immediately — do NOT present findings for user approval first.
+
- The skill provides:
+#### Execution: LLM-Powered Todo Creation (Sub-Agents)
- - Template location: `.claude/skills/file-todos/assets/todo-template.md`
- - Naming convention: `{issue_id}-{status}-{priority}-{description}.md`
- - YAML frontmatter structure: status, priority, issue_id, tags, dependencies
- - All required sections: Problem Statement, Findings, Solutions, etc.
+Todo creation benefits from LLM reasoning — agents can read the actual code at each finding location to write meaningful Problem Statements, generate real Proposed Solutions with pros/cons, and craft acceptance criteria specific to the codebase.
-3. Create todo files in parallel:
+Launch parallel sub-agents grouped by severity:
- ```bash
- {next_id}-pending-{priority}-{description}.md
- ```
-
-4. Examples:
+```
+Task create-todos-p1("
+Create todo files in todos/ for these P1 findings from the code review.
+
+## Instructions:
+1. Read .review/ENRICHED_FINDINGS.json — process only P1 findings
+2. For each P1 finding, read the actual source code at the finding's location to understand full context
+3. Create a todo file using the file-todos skill template: .claude/skills/file-todos/assets/todo-template.md
+4. File naming: {id}-pending-p1-{slug}.md (id zero-padded to 3 digits, slug from title)
+5. Fill in ALL sections with substantive content:
+ - Problem Statement: explain the issue in context of the actual code, not just the finding description
+ - Proposed Solutions: 2-3 real options with pros/cons/effort/risk based on the codebase
+ - Acceptance Criteria: specific, testable checklist items
+ - Work Log: initial entry with today's date
+6. Tag all with 'code-review' plus relevant domain tags
+")
+
+Task create-todos-p2("
+Create todo files in todos/ for these P2 findings from the code review.
+
+## Instructions:
+1. Read .review/ENRICHED_FINDINGS.json — process only P2 findings
+2. For each P2 finding, read the actual source code at the finding's location
+3. Create todo files using the file-todos skill template
+4. File naming: {id}-pending-p2-{slug}.md
+5. Fill in all sections with substantive content
+6. Tag all with 'code-review' plus relevant domain tags
+")
+
+Task create-todos-p3("
+Create todo files in todos/ for these P3 findings from the code review.
+
+## Instructions:
+1. Read .review/ENRICHED_FINDINGS.json — process only P3 findings
+2. For each P3 finding, create a lighter-weight todo file
+3. File naming: {id}-pending-p3-{slug}.md
+4. Problem Statement and Recommendation are sufficient — Proposed Solutions can be brief
+5. Tag all with 'code-review' plus relevant domain tags
+")
+```
- ```
- 001-pending-p1-path-traversal-vulnerability.md
- 002-pending-p1-api-response-validation.md
- 003-pending-p2-concurrency-limit.md
- 004-pending-p3-unused-parameter.md
- ```
+**File naming convention:**
+```
+{issue_id}-pending-{priority}-{description}.md
-5. Follow template structure from file-todos skill: `.claude/skills/file-todos/assets/todo-template.md`
+Examples:
+- 001-pending-p1-path-traversal-vulnerability.md
+- 002-pending-p1-api-response-validation.md
+- 003-pending-p2-concurrency-limit.md
+- 004-pending-p3-unused-parameter.md
+```
-**Todo File Structure (from template):**
+**Todo file structure (from file-todos skill template):**
Each todo must include:
-
- **YAML frontmatter**: status, priority, issue_id, tags, dependencies
- **Problem Statement**: What's broken/missing, why it matters
- **Findings**: Discoveries from agents with evidence/location
- **Proposed Solutions**: 2-3 options, each with pros/cons/effort/risk
-- **Recommended Action**: (Filled during triage, leave blank initially)
-- **Technical Details**: Affected files, components, database changes
+- **Recommended Action**: Leave blank initially
+- **Technical Details**: Affected files, components
- **Acceptance Criteria**: Testable checklist items
-- **Work Log**: Dated record with actions and learnings
-- **Resources**: Links to PR, issues, documentation, similar patterns
+- **Work Log**: Dated record
+- **Resources**: Links to PR, issues, documentation
-**File naming convention:**
+**Status values:** `pending` → `ready` → `complete`
+**Priority values:** `p1` (blocks merge) | `p2` (should fix) | `p3` (nice-to-have)
+**Tagging:** Always add `code-review` tag, plus relevant domain tags
-```
-{issue_id}-{status}-{priority}-{description}.md
+### 10. Summary Report
-Examples:
-- 001-pending-p1-security-vulnerability.md
-- 002-pending-p2-performance-optimization.md
-- 003-pending-p3-code-cleanup.md
-```
+After creating all todo files, present:
-**Status values:**
-
-- `pending` - New findings, needs triage/decision
-- `ready` - Approved by manager, ready to work
-- `complete` - Work finished
-
-**Priority values:**
-
-- `p1` - Critical (blocks merge, security/data issues)
-- `p2` - Important (should fix, architectural/performance)
-- `p3` - Nice-to-have (enhancements, cleanup)
-
-**Tagging:** Always add `code-review` tag, plus: `security`, `performance`, `architecture`, `rails`, `quality`, etc.
-
-#### Step 3: Summary Report
-
-After creating all todo files, present comprehensive summary:
-
-````markdown
+```markdown
## ✅ Code Review Complete
-**Review Target:** PR #XXXX - [PR Title] **Branch:** [branch-name]
+**Review Target:** PR #XXXX - [PR Title]
+**Branch:** [branch-name]
### Findings Summary:
-
-- **Total Findings:** [X]
-- **🔴 CRITICAL (P1):** [count] - BLOCKS MERGE
-- **🟡 IMPORTANT (P2):** [count] - Should Fix
-- **🔵 NICE-TO-HAVE (P3):** [count] - Enhancements
+- **Total Raw Findings:** [X] (from [N] agents)
+- **Hallucinated/Invalid Removed:** [Z]
+- **After Dedup/Judging:** [Y]
+- **🔴 P1 (BLOCKS MERGE):** [count]
+- **🟡 P2 (Should Fix):** [count]
+- **🔵 P3 (Nice-to-Have):** [count]
### Created Todo Files:
-**P1 - Critical (BLOCKS MERGE):**
-
-- `001-pending-p1-{finding}.md` - {description}
-- `002-pending-p1-{finding}.md` - {description}
-
-**P2 - Important:**
-
-- `003-pending-p2-{finding}.md` - {description}
-- `004-pending-p2-{finding}.md` - {description}
-
-**P3 - Nice-to-Have:**
+**P1 — Critical (BLOCKS MERGE):**
+- `001-pending-p1-{finding}.md` — {description}
-- `005-pending-p3-{finding}.md` - {description}
+**P2 — Important:**
+- `003-pending-p2-{finding}.md` — {description}
-### Review Agents Used:
+**P3 — Nice-to-Have:**
+- `005-pending-p3-{finding}.md` — {description}
-- kieran-rails-reviewer
-- security-sentinel
-- performance-oracle
-- architecture-strategist
-- agent-native-reviewer
-- [other agents]
+### Agents Used:
+- [list of agents that ran]
### Next Steps:
-
-1. **Address P1 Findings**: CRITICAL - must be fixed before merge
-
- - Review each P1 todo in detail
- - Implement fixes or request exemption
- - Verify fixes before merging PR
-
-2. **Triage All Todos**:
- ```bash
- ls todos/*-pending-*.md # View all pending todos
- /triage # Use slash command for interactive triage
- ```
-````
-
-3. **Work on Approved Todos**:
-
- ```bash
- /resolve_todo_parallel # Fix all approved items efficiently
- ```
-
-4. **Track Progress**:
- - Rename file when status changes: pending → ready → complete
- - Update Work Log as you work
- - Commit todos: `git add todos/ && git commit -m "refactor: add code review findings"`
-
-### Severity Breakdown:
-
-**🔴 P1 (Critical - Blocks Merge):**
-
-- Security vulnerabilities
-- Data corruption risks
-- Breaking changes
-- Critical architectural issues
-
-**🟡 P2 (Important - Should Fix):**
-
-- Performance issues
-- Significant architectural concerns
-- Major code quality problems
-- Reliability issues
-
-**🔵 P3 (Nice-to-Have):**
-
-- Minor improvements
-- Code cleanup
-- Optimization opportunities
-- Documentation updates
-
+1. **Address P1 Findings** — must fix before merge
+2. **Triage todos:** `ls todos/*-pending-*.md`
+3. **Work on approved items:** `/resolve_todo_parallel`
```
-### 7. End-to-End Testing (Optional)
+### 11. End-to-End Testing (Optional)
-**First, detect the project type from PR files:**
-
| Indicator | Project Type |
|-----------|--------------|
-| `*.xcodeproj`, `*.xcworkspace`, `Package.swift` (iOS) | iOS/macOS |
-| `Gemfile`, `package.json`, `app/views/*`, `*.html.*` | Web |
-| Both iOS files AND web files | Hybrid (test both) |
+| `*.xcodeproj`, `*.xcworkspace`, `Package.swift` | iOS/macOS |
+| `Gemfile`, `package.json`, `app/views/*` | Web |
+| Both | Hybrid |
-
-
-After presenting the Summary Report, offer appropriate testing based on project type:
-
-**For Web Projects:**
-```markdown
-**"Want to run browser tests on the affected pages?"**
-1. Yes - run `/test-browser`
-2. No - skip
-```
-
-**For iOS Projects:**
-```markdown
-**"Want to run Xcode simulator tests on the app?"**
-1. Yes - run `/xcode-test`
-2. No - skip
-```
-
-**For Hybrid Projects (e.g., Rails + Hotwire Native):**
-```markdown
-**"Want to run end-to-end tests?"**
-1. Web only - run `/test-browser`
-2. iOS only - run `/xcode-test`
-3. Both - run both commands
-4. No - skip
-```
-
-
+After presenting the Summary Report, offer testing based on project type:
-#### If User Accepts Web Testing:
+**Web:** "Run browser tests? → `/test-browser`"
+**iOS:** "Run simulator tests? → `/xcode-test`"
+**Hybrid:** "Web only / iOS only / Both / Skip"
-Spawn a subagent to run browser tests (preserves main context):
+If accepted, spawn as sub-agent to preserve main context:
```
-Task general-purpose("Run /test-browser for PR #[number]. Test all affected pages, check for console errors, handle failures by creating todos and fixing.")
+Task general-purpose("Run /test-browser for PR #[number]. Test affected pages, check for errors, create P1 todos for failures.")
```
-The subagent will:
-1. Identify pages affected by the PR
-2. Navigate to each page and capture snapshots (using Playwright MCP or agent-browser CLI)
-3. Check for console errors
-4. Test critical interactions
-5. Pause for human verification on OAuth/email/payment flows
-6. Create P1 todos for any failures
-7. Fix and retry until all tests pass
-
-**Standalone:** `/test-browser [PR number]`
-
-#### If User Accepts iOS Testing:
-
-Spawn a subagent to run Xcode tests (preserves main context):
-
```
-Task general-purpose("Run /xcode-test for scheme [name]. Build for simulator, install, launch, take screenshots, check for crashes.")
+Task general-purpose("Run /xcode-test for scheme [name]. Build, install, screenshot, check for crashes.")
```
-The subagent will:
-1. Verify XcodeBuildMCP is installed
-2. Discover project and schemes
-3. Build for iOS Simulator
-4. Install and launch app
-5. Take screenshots of key screens
-6. Capture console logs for errors
-7. Pause for human verification (Sign in with Apple, push, IAP)
-8. Create P1 todos for any failures
-9. Fix and retry until all tests pass
-
-**Standalone:** `/xcode-test [scheme]`
-
### Important: P1 Findings Block Merge
Any **🔴 P1 (CRITICAL)** findings must be addressed before merging the PR. Present these prominently and ensure they're resolved before accepting the PR.
-```
+
+---
+
+## Appendix: Token Budget Reference
+
+**Parent context (what matters for avoiding overflow):**
+
+| Component | Token Budget | Notes |
+|-----------|-------------|-------|
+| PR metadata | ~500 | Title, body, file list |
+| Intent analyzer return | ~100 | One sentence confirmation |
+| Per-specialist summary returned to parent | ~100-150 | One sentence + counts (5-7 agents) |
+| Validation script | ~0 | Bash, no LLM tokens |
+| Judge agent return | ~100 | One sentence + counts |
+| Deep analysis return | ~100 | One sentence confirmation |
+| Enriched findings (ENRICHED_FINDINGS.json) | ~5,000-8,000 | ≤15 findings with impact/scenarios |
+| Todo creation returns | ~300-450 | 3 agents × ~100-150 tokens each |
+| Parent orchestrator overhead | ~5,000 | Instructions, synthesis, report |
+| **Total parent context from agents** | **~11,000-14,000** | **vs ~30,000-50,000 in v1** |
+
+**Sub-agent spawns (quality-maximized):**
+
+| Agent | Context Cost | Purpose |
+|-------|-------------|---------|
+| Intent analyzer | 1 window | Shared context for all specialists |
+| 5-7 specialist reviewers | 5-7 windows | Deep domain-specific analysis |
+| Judge | 1 window | Dedup, evidence-check, rank |
+| Deep analysis | 1 window | Stakeholder impact, scenarios, fix suggestions |
+| 3 todo creators | 3 windows | Code-aware, substantive todo files |
+| **Total** | **11-13 windows** | **Each isolated, parent stays lean** |
+
+The key insight: sub-agent context windows are independent and disposable. Only what they *return* to the parent matters for overflow. Every sub-agent returns ~100 tokens. The parent reads one file (ENRICHED_FINDINGS.json) at ~5-8k tokens. Everything else lives on disk.