diff --git a/plugins/compound-engineering/.claude-plugin/plugin.json b/plugins/compound-engineering/.claude-plugin/plugin.json index 842ed85c..f6636a99 100644 --- a/plugins/compound-engineering/.claude-plugin/plugin.json +++ b/plugins/compound-engineering/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "compound-engineering", - "version": "2.30.0", + "version": "2.31.0", "description": "AI-powered development tools. 29 agents, 25 commands, 16 skills, 1 MCP server for code review, research, design, and workflow automation.", "author": { "name": "Kieran Klaassen", diff --git a/plugins/compound-engineering/CHANGELOG.md b/plugins/compound-engineering/CHANGELOG.md index e55f5c76..02035b91 100644 --- a/plugins/compound-engineering/CHANGELOG.md +++ b/plugins/compound-engineering/CHANGELOG.md @@ -5,6 +5,30 @@ All notable changes to the compound-engineering plugin will be documented in thi The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [2.31.0] - 2026-02-12 + +### Changed + +- **`/workflows:review` command** — Rewritten as v2 with context-managed map-reduce architecture + - **Architecture**: Phased File-Based Map-Reduce — sub-agents write full analysis to `.review/`, return ~100 tokens each to parent. Parent context stays under ~12k tokens vs ~30-50k in v1 + - **6 phases**: Intent → Map (parallel specialists) → Validate → Judge (dedup + evidence-check) → Deep Analysis (stakeholder impact) → Reduce (todo creation) + - **Smart agent selection**: 5 always-run core agents + conditional agents based on PR file types (vs v1's "run ALL agents") + - **Cross-platform**: Uses `.review/` scratchpad (not `/tmp/`), Node.js validation (not Python3) + - **Judge phase**: Deduplicates findings, discards hallucinated file references, ranks by severity × confidence, caps at 15 findings + - **Deep Analysis phase**: Replaces v1's inline Ultra-Thinking with isolated sub-agent for stakeholder perspectives and scenario exploration on P1/P2 findings + - **Protected artifacts**: Excluded paths (docs/plans/, docs/solutions/) enforced in both agent prompts and judge phase + - **0-findings short-circuit**: Clean PRs get a summary without running deep analysis or todo creation + +### Fixed + +- Removed 4 ghost agent references (`rails-turbo-expert`, `dependency-detective`, `devops-harmony-analyst`, `code-philosopher`) that don't exist in the plugin +- Added 5 missing agents to selection matrix: `kieran-typescript-reviewer`, `kieran-python-reviewer`, `julik-frontend-races-reviewer`, `schema-drift-detector`, `learnings-researcher` +- Replaced Python3 JSON validation with Node.js (cross-platform — Python3 not guaranteed on all systems) +- Fixed `.gitignore` append duplication on repeated runs (now uses `grep -q` guard) +- Added excluded paths check to judge prompt (v1 only had it in agent prompts, not synthesis) + +--- + ## [2.30.0] - 2026-02-05 ### Added diff --git a/plugins/compound-engineering/commands/workflows/review.md b/plugins/compound-engineering/commands/workflows/review.md index e6593034..62db4ce0 100644 --- a/plugins/compound-engineering/commands/workflows/review.md +++ b/plugins/compound-engineering/commands/workflows/review.md @@ -1,24 +1,36 @@ --- name: workflows:review -description: Perform exhaustive code reviews using multi-agent analysis, ultra-thinking, and worktrees +description: Perform exhaustive code reviews using multi-agent analysis with file-based synthesis to prevent context overflow argument-hint: "[PR number, GitHub URL, branch name, or latest]" --- -# Review Command +# Review Command (v2 — Context-Managed) - Perform exhaustive code reviews using multi-agent analysis, ultra-thinking, and Git worktrees for deep local inspection. +Perform exhaustive code reviews using multi-agent analysis with file-based synthesis to prevent context overflow. ## Introduction Senior Code Review Architect with expertise in security, performance, architecture, and quality assurance +## Architecture: Why This Command Works + +This command uses the **Phased File-Based Map-Reduce** pattern — optimized for review quality while staying within context limits: + +1. **Intent Phase** (sequential) — A single agent analyzes the PR and produces a shared intent summary + architectural context. All specialists receive this so they start from the same understanding. +2. **Map Phase** (parallel) — Specialist sub-agents write full analysis to `.review/`, return only a single sentence to the parent (~100 tokens each vs 2k-4k) +3. **Validate** — Verifies all expected agent files exist (re-runs missing agents), then checks JSON schema compliance +4. **Judge Phase** — A judge agent deduplicates, evidence-checks (discards hallucinated findings), ranks, and selects top-N +5. **Deep Analysis Phase** — A sub-agent performs stakeholder impact and scenario analysis on P1/P2 findings, enriching the judged report +6. **Reduce Phase** — The parent reads only `ENRICHED_FINDINGS.json` (~8k tokens max), then spawns todo-creation agents that read the actual code to produce high-quality, actionable todos + +This keeps the parent context under ~12k tokens of agent output regardless of how many agents run, while maximizing analytical depth at every stage. + ## Prerequisites - Git repository with GitHub CLI (`gh`) installed and authenticated - Clean main/master branch - Proper permissions to create worktrees and access the repository -- For document reviews: Path to a markdown file or document ## Main Tasks @@ -27,500 +39,545 @@ argument-hint: "[PR number, GitHub URL, branch name, or latest]" #$ARGUMENTS - -First, I need to determine the review target type and set up the code for analysis. - - -#### Immediate Actions: - - - [ ] Determine review type: PR number (numeric), GitHub URL, file path (.md), or empty (current branch) - [ ] Check current git branch -- [ ] If ALREADY on the target branch (PR branch, requested branch name, or the branch already checked out for review) → proceed with analysis on current branch -- [ ] If DIFFERENT branch than the review target → offer to use worktree: "Use git-worktree skill for isolated Call `skill: git-worktree` with branch name -- [ ] Fetch PR metadata using `gh pr view --json` for title, body, files, linked issues -- [ ] Set up language-specific analysis tools -- [ ] Prepare security scanning environment -- [ ] Make sure we are on the branch we are reviewing. Use gh pr checkout to switch to the branch or manually checkout the branch. - -Ensure that the code is ready for analysis (either in worktree or on current branch). ONLY then proceed to the next step. - +- [ ] If ALREADY on the target branch → proceed with analysis on current branch +- [ ] If DIFFERENT branch → use git-worktree skill: `skill: git-worktree` with branch name +- [ ] Fetch PR metadata using `gh pr view --json title,body,files,labels,milestone` +- [ ] Get the diff: `gh pr diff ` — save to `.review/pr_diff.txt` +- [ ] Make sure we are on the branch we are reviewing -#### Protected Artifacts +### 2. Prepare the Scratchpad Directory - -The following paths are compound-engineering pipeline artifacts and must never be flagged for deletion, removal, or gitignore by any review agent: + +Use a project-relative path, NOT /tmp/. The /tmp/ path causes two problems: +1. Claude Code's Read tool and MCP filesystem tools cannot access /tmp/ (outside allowed directories) +2. On Windows, /tmp/ resolves to different locations depending on the subprocess (MSYS2 vs literal C:\tmp), splitting agent files across directories -- `docs/plans/*.md` — Plan files created by `/workflows:plan`. These are living documents that track implementation progress (checkboxes are checked off by `/workflows:work`). -- `docs/solutions/*.md` — Solution documents created during the pipeline. +Using .review/ inside the project avoids both issues. All tools (Read, Write, Bash, MCP) can access project-relative paths reliably. + -If a review agent flags any file in these directories for cleanup or removal, discard that finding during synthesis. Do not create a todo for it. - +```bash +# Create the review session directory (project-relative, cross-platform safe) +REVIEW_DIR=".review" +rm -rf "$REVIEW_DIR" +mkdir -p "$REVIEW_DIR" +grep -q '.review/' .gitignore 2>/dev/null || echo "$REVIEW_DIR/" >> .gitignore # Ensure it's gitignored (idempotent) + +# Save PR metadata for agents to reference +gh pr view --json title,body,files > "$REVIEW_DIR/pr_meta.json" +gh pr diff > "$REVIEW_DIR/pr_diff.txt" +``` -#### Parallel Agents to review the PR: +Each sub-agent will READ from `$REVIEW_DIR/pr_diff.txt` and WRITE its findings to `$REVIEW_DIR/{agent_name}.json`. - +All references below use `.review/` — this is the ONLY path agents should use. Do not use `/tmp/` anywhere. -Run ALL or most of these agents at the same time: - -1. Task kieran-rails-reviewer(PR content) -2. Task dhh-rails-reviewer(PR title) -3. If turbo is used: Task rails-turbo-expert(PR content) -4. Task git-history-analyzer(PR content) -5. Task dependency-detective(PR content) -6. Task pattern-recognition-specialist(PR content) -7. Task architecture-strategist(PR content) -8. Task code-philosopher(PR content) -9. Task security-sentinel(PR content) -10. Task performance-oracle(PR content) -11. Task devops-harmony-analyst(PR content) -12. Task data-integrity-guardian(PR content) -13. Task agent-native-reviewer(PR content) - Verify new features are agent-accessible +### 3. PR Intent Analysis (Phase 0 — Sequential) - + +Run this BEFORE launching specialist agents. This produces the shared context that all specialists receive, ensuring they start from the same understanding and spend their reasoning on domain-specific analysis rather than independently re-deriving PR intent. + -#### Conditional Agents (Run if applicable): - - +``` +Task pr-intent-analyzer(" +You are a PR Intent Analyzer. Your job is to produce a concise, high-signal summary that specialist review agents will use as shared context. + +## Instructions: +1. Read .review/pr_diff.txt and .review/pr_meta.json +2. Write your analysis to .review/PR_INTENT.md (max 500 words) +3. Your analysis MUST cover: + - **Intent**: What is this PR trying to accomplish? (2-3 sentences) + - **Approach**: How does it achieve this? Key architectural decisions made. (3-5 sentences) + - **Scope**: Files changed, components affected, boundaries of the change + - **Risk Surface**: Which areas carry the most risk — new code, refactored paths, changed interfaces, data model changes + - **Testing Gap**: What's tested vs what's not, based on test files in the diff +4. Return to parent: 'Intent analysis complete. Written to .review/PR_INTENT.md' +") +``` -These agents are run ONLY when the PR matches specific criteria. Check the PR files list to determine if they apply: +Wait for this to complete before launching specialists. The ~30 second cost pays for itself in higher-quality, less-redundant specialist findings. -**If PR contains database migrations (db/migrate/*.rb files) or data backfills:** +### 4. Launch Review Agents (Map Phase) -14. Task data-migration-expert(PR content) - Validates ID mappings match production, checks for swapped values, verifies rollback safety -15. Task deployment-verification-agent(PR content) - Creates Go/No-Go deployment checklist with SQL verification queries + +EVERY sub-agent prompt MUST include these output constraints AND the shared intent context. This is what prevents context overflow while maximizing quality. -**When to run migration agents:** -- PR includes files matching `db/migrate/*.rb` -- PR modifies columns that store IDs, enums, or mappings -- PR includes data backfill scripts or rake tasks -- PR changes how data is read/written (e.g., changing from FK to string column) -- PR title/body mentions: migration, backfill, data transformation, ID mapping +Append this to EVERY agent spawn prompt: -**What these agents check:** -- `data-migration-expert`: Verifies hard-coded mappings match production reality (prevents swapped IDs), checks for orphaned associations, validates dual-write patterns -- `deployment-verification-agent`: Produces executable pre/post-deploy checklists with SQL queries, rollback procedures, and monitoring plans +``` +## SHARED CONTEXT +Read .review/PR_INTENT.md first for PR intent, approach, and risk surface. Do not re-derive what the PR does — focus your full reasoning on your specialist domain. + +## OUTPUT RULES (MANDATORY) +1. Read the diff from .review/pr_diff.txt +2. Write your FULL analysis as JSON to .review/{your_agent_name}.json +3. Use this EXACT schema for findings (hard caps enforced): + { + "agent_type": "security|performance|architecture|rails|patterns|simplicity|...", + "summary": "<500 chars max — your key takeaway>", + "findings_count": , + "critical_count": , + "findings": [ + { + "severity": "p1|p2|p3", + "category": "<50 chars>", + "title": "<100 chars>", + "location": "path/to/file.ext:line_number", + "description": "<300 chars>", + "recommendation": "<200 chars>", + "confidence": 0.0-1.0 + } + ] + } +4. Max 10 findings per agent. Prioritize by severity. No prose analysis fields. +5. Only report findings with confidence >= 0.6. Quality over quantity. +6. EXCLUDED PATHS — do not analyze or report findings for: + - docs/plans/*.md + - docs/solutions/*.md +7. Every finding MUST reference a real file path and line number from the diff. Do not report findings without specific location evidence. +8. Return ONLY this to the parent (do NOT return the full analysis): + "Review complete. Wrote findings ( critical) to .review/{agent_name}.json. Key takeaway: <1 sentence>" +``` + - +#### Agent Selection — Choose Based on PR Content -### 4. Ultra-Thinking Deep Dive Phases + +Do NOT run all agents on every PR. Examine the PR file list and select relevant agents: - For each phase below, spend maximum cognitive effort. Think step by step. Consider all angles. Question assumptions. And bring all reviews in a synthesis to the user. +**Always Run (Core — 5 agents):** +1. `architecture-strategist` — structural patterns, coupling, abstractions, component boundaries +2. `security-sentinel` — vulnerabilities, auth, input validation, injection +3. `performance-oracle` — N+1 queries, memory leaks, scaling concerns, resource exhaustion +4. `code-simplicity-reviewer` — unnecessary complexity, dead code, simplification opportunities +5. `pattern-recognition-specialist` — anti-patterns, inconsistencies, convention violations - -Complete system context map with component interactions - +**Run If Applicable (language-specific):** +- `kieran-rails-reviewer` — only if PR has `.rb` files +- `dhh-rails-reviewer` — only if PR has `.rb` files +- `kieran-typescript-reviewer` — only if PR has `.ts`/`.tsx` files +- `kieran-python-reviewer` — only if PR has `.py` files +- `julik-frontend-races-reviewer` — only if PR has Stimulus controllers or frontend async code -#### Phase 3: Stakeholder Perspective Analysis +**Run If Applicable (domain-specific):** +- `agent-native-reviewer` — only if PR adds user-facing features +- `data-integrity-guardian` — only if PR modifies database queries or data flows +- `git-history-analyzer` — if PR touches files with complex change history or frequent churn +- `schema-drift-detector` — if PR includes `schema.rb` changes, cross-reference against included migrations +- `learnings-researcher` — if PR touches areas where `docs/solutions/` may have relevant prior art - ULTRA-THINK: Put yourself in each stakeholder's shoes. What matters to them? What are their pain points? +**Migration-Specific (only for db/migrate/*.rb files):** +- `data-migration-expert` — validates ID mappings, rollback safety +- `deployment-verification-agent` — Go/No-Go deployment checklist + - +#### Launch Pattern -1. **Developer Perspective** + +Launch selected agents in parallel. Each agent reads the diff from the scratchpad and writes its findings there. - - How easy is this to understand and modify? - - Are the APIs intuitive? - - Is debugging straightforward? - - Can I test this easily? +Example for a TypeScript PR (no Rails, no migrations): -2. **Operations Perspective** +``` +Task architecture-strategist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task security-sentinel("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task performance-oracle("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task code-simplicity-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task pattern-recognition-specialist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task kieran-typescript-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +``` - - How do I deploy this safely? - - What metrics and logs are available? - - How do I troubleshoot issues? - - What are the resource requirements? +Example for a Rails PR with migrations: -3. **End User Perspective** +``` +Task kieran-rails-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task dhh-rails-reviewer("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task security-sentinel("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task performance-oracle("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task architecture-strategist("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task data-migration-expert("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +Task schema-drift-detector("Review PR #49. . " + SHARED_CONTEXT + OUTPUT_RULES) +``` - - Is the feature intuitive? - - Are error messages helpful? - - Is performance acceptable? - - Does it solve my problem? +Wait for ALL agents to complete. + -4. **Security Team Perspective** +### 5. Verify and Validate Agent Outputs - - What's the attack surface? - - Are there compliance requirements? - - How is data protected? - - What are the audit capabilities? + +Two checks run before the judge: (1) verify every launched agent actually produced a file, (2) validate the JSON schema. This catches the silent-failure bug where an agent reports completion but its file is missing. + -5. **Business Perspective** - - What's the ROI? - - Are there legal/compliance risks? - - How does this affect time-to-market? - - What's the total cost of ownership? +#### Step 5a: Verify All Expected Files Exist -#### Phase 4: Scenario Exploration +```bash +# List expected agent files based on which agents were launched +# (adjust this list to match the agents you actually spawned in Step 4) +EXPECTED_AGENTS="architecture security performance simplicity patterns" # or whatever you launched + +MISSING="" +for agent in $EXPECTED_AGENTS; do + if ! ls .review/${agent}*.json 1>/dev/null 2>&1; then + MISSING="$MISSING $agent" + echo "MISSING: $agent — no output file found in .review/" + fi +done + +if [ -n "$MISSING" ]; then + echo "⚠️ Missing agent files:$MISSING" + echo "Re-run these agents before proceeding to judge." +fi +``` - ULTRA-THINK: Explore edge cases and failure scenarios. What could go wrong? How does the system behave under stress? +**If any agent file is missing:** Re-launch that specific agent before proceeding. Do not skip to the judge — silent agent failures caused 10 findings (including 2 P1s) to be lost in testing. - +#### Step 5b: Validate JSON Schema -- [ ] **Happy Path**: Normal operation with valid inputs -- [ ] **Invalid Inputs**: Null, empty, malformed data -- [ ] **Boundary Conditions**: Min/max values, empty collections -- [ ] **Concurrent Access**: Race conditions, deadlocks -- [ ] **Scale Testing**: 10x, 100x, 1000x normal load -- [ ] **Network Issues**: Timeouts, partial failures -- [ ] **Resource Exhaustion**: Memory, disk, connections -- [ ] **Security Attacks**: Injection, overflow, DoS -- [ ] **Data Corruption**: Partial writes, inconsistency -- [ ] **Cascading Failures**: Downstream service issues +```bash +# Validate all agent output files (uses Node.js — guaranteed available since Claude Code requires it) +for f in .review/*.json; do + [[ "$(basename "$f")" == "pr_meta.json" ]] && continue + node -e " +const fs = require('fs'); +const f = '$f'; +const name = require('path').basename(f); +try { + const data = JSON.parse(fs.readFileSync(f, 'utf8')); + if (!Array.isArray(data.findings)) throw new Error('findings not an array'); + if (data.findings.length > 10) throw new Error('too many findings: ' + data.findings.length); + data.findings.forEach((finding, i) => { + if (!['p1','p2','p3'].includes(finding.severity)) throw new Error('finding ' + i + ': bad severity'); + if (!finding.location) throw new Error('finding ' + i + ': missing location'); + }); + console.log('VALID:', name, '-', data.findings.length, 'findings'); +} catch (e) { + console.log('INVALID:', name, '-', e.message, '— removing from review'); + fs.unlinkSync(f); +} +" 2>&1 +done +``` -### 6. Multi-Angle Review Perspectives +If an agent produced invalid output, it's removed before judging. The review continues with the valid agent outputs. -#### Technical Excellence Angle +**Checkpoint:** At this point, every launched agent should have a valid JSON file in `.review/`. If not, re-run the missing/invalid agents before proceeding. -- Code craftsmanship evaluation -- Engineering best practices -- Technical documentation quality -- Tooling and automation assessment +### 6. Judge Phase — Deduplicate and Rank -#### Business Value Angle + +Do NOT read individual agent JSON files into your context. Launch a JUDGE agent that reads them in its own context window. + -- Feature completeness validation -- Performance impact on users -- Cost-benefit analysis -- Time-to-market considerations +``` +Task judge-findings(" +You are a Code Review Judge. Consolidate findings from multiple review agents into a single, deduplicated, ranked report. + +## Instructions: +1. Read ALL JSON files in .review/*.json (skip pr_meta.json and pr_diff.txt) +2. Read .review/pr_meta.json to get the list of files actually in this PR +3. Collect all findings across agents +4. EVIDENCE CHECK: Discard any finding whose 'location' field does not reference a file path present in pr_meta.json. These are hallucinated findings. +4b. EXCLUDED PATHS: Also discard any finding whose location is under docs/plans/ or docs/solutions/ — these are compound-engineering pipeline artifacts and must never be flagged. +5. DEDUPLICATE: Remove semantically similar findings (keep the higher-confidence one) +6. RESOLVE CONFLICTS: If agents contradict each other, note the conflict and pick the more evidence-based finding +7. RANK by: severity (p1 > p2 > p3), then confidence score +8. SELECT top 15 findings maximum. Only include findings with confidence >= 0.6. +9. For each P1 finding, add a brief 'impact' field (<200 chars) assessing: deployment risk, user impact, and attack surface. +10. Write the consolidated report to .review/JUDGED_FINDINGS.json using this schema: + +{ + \"pr_number\": , + \"total_raw_findings\": , + \"duplicates_removed\": , + \"hallucinated_removed\": , + \"conflicts_resolved\": , + \"final_findings\": [ + { + \"id\": 1, + \"severity\": \"p1|p2|p3\", + \"category\": \"<50 chars>\", + \"title\": \"<100 chars>\", + \"location\": \"path/to/file.ext:line\", + \"description\": \"<300 chars>\", + \"recommendation\": \"<200 chars>\", + \"confidence\": 0.0-1.0, + \"source_agents\": [\"agent1\", \"agent2\"], + \"effort\": \"small|medium|large\", + \"impact\": \"<200 chars, P1 only, null for P2/P3>\" + } + ], + \"agent_summaries\": [ + {\"agent\": \"name\", \"summary\": \"\"} + ] +} + +11. Return to parent: 'Judging complete. raw findings → after dedup ( hallucinated removed). P1: , P2: , P3: . Report at .review/JUDGED_FINDINGS.json' +") +``` -#### Risk Management Angle +**Short-circuit: If the judge returns 0 final findings**, skip Deep Analysis and Todo Creation. Present a clean summary: -- Security risk assessment -- Operational risk evaluation -- Compliance risk verification -- Technical debt accumulation +```markdown +## ✅ Code Review Complete — No Findings -#### Team Dynamics Angle +**Review Target:** PR #XXXX - [PR Title] +**Branch:** [branch-name] -- Code review etiquette -- Knowledge sharing effectiveness -- Collaboration patterns -- Mentoring opportunities +All [N] agents reviewed the PR. [X] raw findings were raised, but all were either hallucinated (wrong file paths), duplicates, or below confidence threshold after judging. No action items. +``` -### 4. Simplification and Minimalism Review +### 7. Deep Analysis Phase (P1/P2 Enrichment) -Run the Task code-simplicity-reviewer() to see if we can simplify the code. + +After judging, spawn a deep-analysis agent to enrich findings. This runs in its own context window so it doesn't cost the parent anything, but it produces significantly richer findings for todo creation. -### 5. Findings Synthesis and Todo Creation Using file-todos Skill +This agent performs the Ultra-Thinking deep dive from the original review command — stakeholder perspectives, scenario exploration, and multi-angle analysis — but scoped to the judged findings rather than the raw diff, and in an isolated context window rather than the parent. + - ALL findings MUST be stored in the todos/ directory using the file-todos skill. Create todo files immediately after synthesis - do NOT present findings for user approval first. Use the skill for structured todo management. +``` +Task deep-analysis(" +You are a Deep Analysis Reviewer. Your job is to enrich judged code review findings with deep stakeholder analysis, scenario exploration, and concrete fix suggestions. -#### Step 1: Synthesize All Findings +## Instructions: +1. Read .review/JUDGED_FINDINGS.json +2. Read .review/pr_diff.txt for code context +3. Read .review/PR_INTENT.md for PR context - -Consolidate all agent reports into a categorized list of findings. -Remove duplicates, prioritize by severity and impact. - +## For each P1 finding, perform ALL of the following: - +### Stakeholder Impact Analysis +Put yourself in each stakeholder's shoes. Assess impact from each perspective: -- [ ] Collect findings from all parallel agents -- [ ] Discard any findings that recommend deleting or gitignoring files in `docs/plans/` or `docs/solutions/` (see Protected Artifacts above) -- [ ] Categorize by type: security, performance, architecture, quality, etc. -- [ ] Assign severity levels: 🔴 CRITICAL (P1), 🟡 IMPORTANT (P2), 🔵 NICE-TO-HAVE (P3) -- [ ] Remove duplicate or overlapping findings -- [ ] Estimate effort for each finding (Small/Medium/Large) +- **Developer**: How easy is the affected code to understand and modify? Are the APIs intuitive? Is debugging straightforward? Can this be tested easily? +- **Operations**: How do I deploy this safely? What metrics and logs are available? How do I troubleshoot if this goes wrong? What are the resource requirements? +- **End User**: Is the feature intuitive? Are error messages helpful? Is performance acceptable? Does it solve their problem? +- **Security**: What's the attack surface? Are there compliance requirements? How is data protected? What are the audit capabilities? +- **Business**: What's the ROI impact? Are there legal/compliance risks? How does this affect time-to-market? - +Add a 'stakeholder_impact' object with a field for each perspective (each 1-3 sentences). -#### Step 2: Create Todo Files Using file-todos Skill +### Scenario Exploration +Think through edge cases and failure scenarios. For the code involved in this finding, consider: - Use the file-todos skill to create todo files for ALL findings immediately. Do NOT present findings one-by-one asking for user approval. Create all todo files in parallel using the skill, then summarize results to user. +- Invalid/malformed inputs (null, empty, unexpected types) +- Boundary conditions (min/max values, empty collections) +- Concurrent access (race conditions, deadlocks) +- Scale pressure (10x, 100x normal load) +- Network issues (timeouts, partial failures) +- Resource exhaustion (memory, disk, connections) +- Security vectors (injection, overflow, DoS) +- Data corruption (partial writes, inconsistency) +- Cascading failures (downstream service issues) -**Implementation Options:** +Add a 'scenarios' array listing the 3-5 most relevant failure scenarios with specific trigger conditions and expected impact. -**Option A: Direct File Creation (Fast)** +### Suggested Fix +Add a 'suggested_fix' field with a concrete code-level suggestion for resolution (<500 chars). -- Create todo files directly using Write tool -- All findings in parallel for speed -- Use standard template from `.claude/skills/file-todos/assets/todo-template.md` -- Follow naming convention: `{issue_id}-pending-{priority}-{description}.md` +## For each P2 finding: +- Add 'stakeholder_impact' with Developer and Operations perspectives (1-2 sentences each) +- Add 'scenarios' with 1-2 most relevant failure scenarios +- Add 'suggested_fix' (<300 chars) -**Option B: Sub-Agents in Parallel (Recommended for Scale)** For large PRs with 15+ findings, use sub-agents to create finding files in parallel: +## For P3 findings: +- Pass through unchanged -```bash -# Launch multiple finding-creator agents in parallel -Task() - Create todos for first finding -Task() - Create todos for second finding -Task() - Create todos for third finding -etc. for each finding. +## Output: +Write the enriched report to .review/ENRICHED_FINDINGS.json (same schema as JUDGED_FINDINGS.json, plus the new fields). +Return to parent: 'Deep analysis complete. Enriched P1 and P2 findings. Written to .review/ENRICHED_FINDINGS.json' +") ``` -Sub-agents can: - -- Process multiple findings simultaneously -- Write detailed todo files with all sections filled -- Organize findings by severity -- Create comprehensive Proposed Solutions -- Add acceptance criteria and work logs -- Complete much faster than sequential processing - -**Execution Strategy:** - -1. Synthesize all findings into categories (P1/P2/P3) -2. Group findings by severity -3. Launch 3 parallel sub-agents (one per severity level) -4. Each sub-agent creates its batch of todos using the file-todos skill -5. Consolidate results and present summary - -**Process (Using file-todos Skill):** +### 8. Synthesis — Read ONLY the Enriched Findings -1. For each finding: + +NOW read `.review/ENRICHED_FINDINGS.json` — this is the ONLY agent output file you read into your context. This file is pre-deduplicated, evidence-checked, ranked, enriched with impact analysis, and capped at ~15 findings. + - - Determine severity (P1/P2/P3) - - Write detailed Problem Statement and Findings - - Create 2-3 Proposed Solutions with pros/cons/effort/risk - - Estimate effort (Small/Medium/Large) - - Add acceptance criteria and work log +Read the enriched findings file and proceed to create todos. -2. Use file-todos skill for structured todo management: +### 9. Create Todo Files - ```bash - skill: file-todos - ``` + +ALL findings from ENRICHED_FINDINGS.json MUST be stored in the todos/ directory. Create todo files immediately — do NOT present findings for user approval first. + - The skill provides: +#### Execution: LLM-Powered Todo Creation (Sub-Agents) - - Template location: `.claude/skills/file-todos/assets/todo-template.md` - - Naming convention: `{issue_id}-{status}-{priority}-{description}.md` - - YAML frontmatter structure: status, priority, issue_id, tags, dependencies - - All required sections: Problem Statement, Findings, Solutions, etc. +Todo creation benefits from LLM reasoning — agents can read the actual code at each finding location to write meaningful Problem Statements, generate real Proposed Solutions with pros/cons, and craft acceptance criteria specific to the codebase. -3. Create todo files in parallel: +Launch parallel sub-agents grouped by severity: - ```bash - {next_id}-pending-{priority}-{description}.md - ``` - -4. Examples: +``` +Task create-todos-p1(" +Create todo files in todos/ for these P1 findings from the code review. + +## Instructions: +1. Read .review/ENRICHED_FINDINGS.json — process only P1 findings +2. For each P1 finding, read the actual source code at the finding's location to understand full context +3. Create a todo file using the file-todos skill template: .claude/skills/file-todos/assets/todo-template.md +4. File naming: {id}-pending-p1-{slug}.md (id zero-padded to 3 digits, slug from title) +5. Fill in ALL sections with substantive content: + - Problem Statement: explain the issue in context of the actual code, not just the finding description + - Proposed Solutions: 2-3 real options with pros/cons/effort/risk based on the codebase + - Acceptance Criteria: specific, testable checklist items + - Work Log: initial entry with today's date +6. Tag all with 'code-review' plus relevant domain tags +") + +Task create-todos-p2(" +Create todo files in todos/ for these P2 findings from the code review. + +## Instructions: +1. Read .review/ENRICHED_FINDINGS.json — process only P2 findings +2. For each P2 finding, read the actual source code at the finding's location +3. Create todo files using the file-todos skill template +4. File naming: {id}-pending-p2-{slug}.md +5. Fill in all sections with substantive content +6. Tag all with 'code-review' plus relevant domain tags +") + +Task create-todos-p3(" +Create todo files in todos/ for these P3 findings from the code review. + +## Instructions: +1. Read .review/ENRICHED_FINDINGS.json — process only P3 findings +2. For each P3 finding, create a lighter-weight todo file +3. File naming: {id}-pending-p3-{slug}.md +4. Problem Statement and Recommendation are sufficient — Proposed Solutions can be brief +5. Tag all with 'code-review' plus relevant domain tags +") +``` - ``` - 001-pending-p1-path-traversal-vulnerability.md - 002-pending-p1-api-response-validation.md - 003-pending-p2-concurrency-limit.md - 004-pending-p3-unused-parameter.md - ``` +**File naming convention:** +``` +{issue_id}-pending-{priority}-{description}.md -5. Follow template structure from file-todos skill: `.claude/skills/file-todos/assets/todo-template.md` +Examples: +- 001-pending-p1-path-traversal-vulnerability.md +- 002-pending-p1-api-response-validation.md +- 003-pending-p2-concurrency-limit.md +- 004-pending-p3-unused-parameter.md +``` -**Todo File Structure (from template):** +**Todo file structure (from file-todos skill template):** Each todo must include: - - **YAML frontmatter**: status, priority, issue_id, tags, dependencies - **Problem Statement**: What's broken/missing, why it matters - **Findings**: Discoveries from agents with evidence/location - **Proposed Solutions**: 2-3 options, each with pros/cons/effort/risk -- **Recommended Action**: (Filled during triage, leave blank initially) -- **Technical Details**: Affected files, components, database changes +- **Recommended Action**: Leave blank initially +- **Technical Details**: Affected files, components - **Acceptance Criteria**: Testable checklist items -- **Work Log**: Dated record with actions and learnings -- **Resources**: Links to PR, issues, documentation, similar patterns +- **Work Log**: Dated record +- **Resources**: Links to PR, issues, documentation -**File naming convention:** +**Status values:** `pending` → `ready` → `complete` +**Priority values:** `p1` (blocks merge) | `p2` (should fix) | `p3` (nice-to-have) +**Tagging:** Always add `code-review` tag, plus relevant domain tags -``` -{issue_id}-{status}-{priority}-{description}.md +### 10. Summary Report -Examples: -- 001-pending-p1-security-vulnerability.md -- 002-pending-p2-performance-optimization.md -- 003-pending-p3-code-cleanup.md -``` +After creating all todo files, present: -**Status values:** - -- `pending` - New findings, needs triage/decision -- `ready` - Approved by manager, ready to work -- `complete` - Work finished - -**Priority values:** - -- `p1` - Critical (blocks merge, security/data issues) -- `p2` - Important (should fix, architectural/performance) -- `p3` - Nice-to-have (enhancements, cleanup) - -**Tagging:** Always add `code-review` tag, plus: `security`, `performance`, `architecture`, `rails`, `quality`, etc. - -#### Step 3: Summary Report - -After creating all todo files, present comprehensive summary: - -````markdown +```markdown ## ✅ Code Review Complete -**Review Target:** PR #XXXX - [PR Title] **Branch:** [branch-name] +**Review Target:** PR #XXXX - [PR Title] +**Branch:** [branch-name] ### Findings Summary: - -- **Total Findings:** [X] -- **🔴 CRITICAL (P1):** [count] - BLOCKS MERGE -- **🟡 IMPORTANT (P2):** [count] - Should Fix -- **🔵 NICE-TO-HAVE (P3):** [count] - Enhancements +- **Total Raw Findings:** [X] (from [N] agents) +- **Hallucinated/Invalid Removed:** [Z] +- **After Dedup/Judging:** [Y] +- **🔴 P1 (BLOCKS MERGE):** [count] +- **🟡 P2 (Should Fix):** [count] +- **🔵 P3 (Nice-to-Have):** [count] ### Created Todo Files: -**P1 - Critical (BLOCKS MERGE):** - -- `001-pending-p1-{finding}.md` - {description} -- `002-pending-p1-{finding}.md` - {description} - -**P2 - Important:** - -- `003-pending-p2-{finding}.md` - {description} -- `004-pending-p2-{finding}.md` - {description} - -**P3 - Nice-to-Have:** +**P1 — Critical (BLOCKS MERGE):** +- `001-pending-p1-{finding}.md` — {description} -- `005-pending-p3-{finding}.md` - {description} +**P2 — Important:** +- `003-pending-p2-{finding}.md` — {description} -### Review Agents Used: +**P3 — Nice-to-Have:** +- `005-pending-p3-{finding}.md` — {description} -- kieran-rails-reviewer -- security-sentinel -- performance-oracle -- architecture-strategist -- agent-native-reviewer -- [other agents] +### Agents Used: +- [list of agents that ran] ### Next Steps: - -1. **Address P1 Findings**: CRITICAL - must be fixed before merge - - - Review each P1 todo in detail - - Implement fixes or request exemption - - Verify fixes before merging PR - -2. **Triage All Todos**: - ```bash - ls todos/*-pending-*.md # View all pending todos - /triage # Use slash command for interactive triage - ``` -```` - -3. **Work on Approved Todos**: - - ```bash - /resolve_todo_parallel # Fix all approved items efficiently - ``` - -4. **Track Progress**: - - Rename file when status changes: pending → ready → complete - - Update Work Log as you work - - Commit todos: `git add todos/ && git commit -m "refactor: add code review findings"` - -### Severity Breakdown: - -**🔴 P1 (Critical - Blocks Merge):** - -- Security vulnerabilities -- Data corruption risks -- Breaking changes -- Critical architectural issues - -**🟡 P2 (Important - Should Fix):** - -- Performance issues -- Significant architectural concerns -- Major code quality problems -- Reliability issues - -**🔵 P3 (Nice-to-Have):** - -- Minor improvements -- Code cleanup -- Optimization opportunities -- Documentation updates - +1. **Address P1 Findings** — must fix before merge +2. **Triage todos:** `ls todos/*-pending-*.md` +3. **Work on approved items:** `/resolve_todo_parallel` ``` -### 7. End-to-End Testing (Optional) +### 11. End-to-End Testing (Optional) -**First, detect the project type from PR files:** - | Indicator | Project Type | |-----------|--------------| -| `*.xcodeproj`, `*.xcworkspace`, `Package.swift` (iOS) | iOS/macOS | -| `Gemfile`, `package.json`, `app/views/*`, `*.html.*` | Web | -| Both iOS files AND web files | Hybrid (test both) | +| `*.xcodeproj`, `*.xcworkspace`, `Package.swift` | iOS/macOS | +| `Gemfile`, `package.json`, `app/views/*` | Web | +| Both | Hybrid | - - -After presenting the Summary Report, offer appropriate testing based on project type: - -**For Web Projects:** -```markdown -**"Want to run browser tests on the affected pages?"** -1. Yes - run `/test-browser` -2. No - skip -``` - -**For iOS Projects:** -```markdown -**"Want to run Xcode simulator tests on the app?"** -1. Yes - run `/xcode-test` -2. No - skip -``` - -**For Hybrid Projects (e.g., Rails + Hotwire Native):** -```markdown -**"Want to run end-to-end tests?"** -1. Web only - run `/test-browser` -2. iOS only - run `/xcode-test` -3. Both - run both commands -4. No - skip -``` - - +After presenting the Summary Report, offer testing based on project type: -#### If User Accepts Web Testing: +**Web:** "Run browser tests? → `/test-browser`" +**iOS:** "Run simulator tests? → `/xcode-test`" +**Hybrid:** "Web only / iOS only / Both / Skip" -Spawn a subagent to run browser tests (preserves main context): +If accepted, spawn as sub-agent to preserve main context: ``` -Task general-purpose("Run /test-browser for PR #[number]. Test all affected pages, check for console errors, handle failures by creating todos and fixing.") +Task general-purpose("Run /test-browser for PR #[number]. Test affected pages, check for errors, create P1 todos for failures.") ``` -The subagent will: -1. Identify pages affected by the PR -2. Navigate to each page and capture snapshots (using Playwright MCP or agent-browser CLI) -3. Check for console errors -4. Test critical interactions -5. Pause for human verification on OAuth/email/payment flows -6. Create P1 todos for any failures -7. Fix and retry until all tests pass - -**Standalone:** `/test-browser [PR number]` - -#### If User Accepts iOS Testing: - -Spawn a subagent to run Xcode tests (preserves main context): - ``` -Task general-purpose("Run /xcode-test for scheme [name]. Build for simulator, install, launch, take screenshots, check for crashes.") +Task general-purpose("Run /xcode-test for scheme [name]. Build, install, screenshot, check for crashes.") ``` -The subagent will: -1. Verify XcodeBuildMCP is installed -2. Discover project and schemes -3. Build for iOS Simulator -4. Install and launch app -5. Take screenshots of key screens -6. Capture console logs for errors -7. Pause for human verification (Sign in with Apple, push, IAP) -8. Create P1 todos for any failures -9. Fix and retry until all tests pass - -**Standalone:** `/xcode-test [scheme]` - ### Important: P1 Findings Block Merge Any **🔴 P1 (CRITICAL)** findings must be addressed before merging the PR. Present these prominently and ensure they're resolved before accepting the PR. -``` + +--- + +## Appendix: Token Budget Reference + +**Parent context (what matters for avoiding overflow):** + +| Component | Token Budget | Notes | +|-----------|-------------|-------| +| PR metadata | ~500 | Title, body, file list | +| Intent analyzer return | ~100 | One sentence confirmation | +| Per-specialist summary returned to parent | ~100-150 | One sentence + counts (5-7 agents) | +| Validation script | ~0 | Bash, no LLM tokens | +| Judge agent return | ~100 | One sentence + counts | +| Deep analysis return | ~100 | One sentence confirmation | +| Enriched findings (ENRICHED_FINDINGS.json) | ~5,000-8,000 | ≤15 findings with impact/scenarios | +| Todo creation returns | ~300-450 | 3 agents × ~100-150 tokens each | +| Parent orchestrator overhead | ~5,000 | Instructions, synthesis, report | +| **Total parent context from agents** | **~11,000-14,000** | **vs ~30,000-50,000 in v1** | + +**Sub-agent spawns (quality-maximized):** + +| Agent | Context Cost | Purpose | +|-------|-------------|---------| +| Intent analyzer | 1 window | Shared context for all specialists | +| 5-7 specialist reviewers | 5-7 windows | Deep domain-specific analysis | +| Judge | 1 window | Dedup, evidence-check, rank | +| Deep analysis | 1 window | Stakeholder impact, scenarios, fix suggestions | +| 3 todo creators | 3 windows | Code-aware, substantive todo files | +| **Total** | **11-13 windows** | **Each isolated, parent stays lean** | + +The key insight: sub-agent context windows are independent and disposable. Only what they *return* to the parent matters for overflow. Every sub-agent returns ~100 tokens. The parent reads one file (ENRICHED_FINDINGS.json) at ~5-8k tokens. Everything else lives on disk.