Skip to content

fix: prevent duplicate features in bulk creation#148

Open
exbald wants to merge 1 commit intoAutoForgeAI:masterfrom
exbald:fix/dedup-bulk-features
Open

fix: prevent duplicate features in bulk creation#148
exbald wants to merge 1 commit intoAutoForgeAI:masterfrom
exbald:fix/dedup-bulk-features

Conversation

@exbald
Copy link

@exbald exbald commented Jan 31, 2026

Problem

When feature_create_bulk is called multiple times (e.g., from expand sessions or if the initializer runs again), it creates duplicate features with the same (category, name) pairs. This can lead to hundreds of duplicate features in the database.

Solution

Added deduplication logic to feature_create_bulk:

  1. Query existing features - Check (category, name) pairs already in the database
  2. Filter input batch - Skip features that already exist or appear multiple times in the same batch
  3. Report skipped count - Return skipped_duplicates in the response so callers know how many were filtered

Changes

  • mcp_server/feature_mcp.py: Added deduplication before feature creation

Testing

Tested on a project that had 607 features (should have been ~98). After identifying the duplicates came from repeated bulk creation calls, this fix prevents the issue from recurring.

Summary by CodeRabbit

Release Notes

  • New Features
    • Bulk feature creation now automatically detects and filters duplicate entries based on category and name.
    • Response includes a count of skipped duplicates for transparency.
    • Improved data integrity by preventing duplicate features within a single bulk creation request.

✏️ Tip: You can customize this high-level summary in your review settings.

When feature_create_bulk is called multiple times (e.g., from expand sessions),
it now checks for existing (category, name) pairs and skips duplicates instead
of creating them again.

Changes:
- Query existing features before creation
- Filter duplicates from input batch
- Report skipped_duplicates count in response
@coderabbitai
Copy link

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

The feature_create_bulk function now includes a pre-creation deduplication pass to identify and filter out duplicate (category, name) pairs from the input batch. The function now follows a four-pass process: validation, duplicate filtering, creation, and dependency resolution. The response includes a new skipped_duplicates field reporting filtered duplicates.

Changes

Cohort / File(s) Summary
Deduplication Logic
mcp_server/feature_mcp.py
Added pre-creation pass to collect existing (category, name) pairs and filter duplicates from input batch. Second and third passes now operate on deduplicated list. Response includes new skipped_duplicates field. Updated control-flow comments to reflect four-pass process.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Whiskers twitching with glee,
Duplicate features fade away,
One by one, clean and neat,
Deduplication saves the day!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: prevent duplicate features in bulk creation' directly and clearly summarizes the main change: adding deduplication logic to the bulk feature creation function to prevent duplicate (category, name) pairs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mcp_server/feature_mcp.py (1)

549-591: ⚠️ Potential issue | 🔴 Critical

Critical bug: Dependency indices break when features are skipped.

The depends_on_indices values refer to positions in the original features list, but after filtering into unique_features, the index mapping is invalidated. This causes:

  1. IndexError if idx >= len(unique_features) after filtering
  2. Wrong dependencies if the dependency target was skipped (exists in DB) — the code will reference the wrong feature from created_features instead of looking up the existing feature's ID
  3. Self-dependency in edge cases where a feature incorrectly depends on itself

Example: If features [A, B] where B has depends_on_indices: [0], and A is skipped (already exists with id=5), then unique_features = [B], created_features = [B], and B's dependency resolves to created_features[0].id (B's own ID) instead of 5.

🔧 Proposed fix: maintain index-to-ID mapping for both new and existing features
             # Second pass: check for existing features to avoid duplicates
-            existing = set()
-            for f in session.query(Feature.category, Feature.name).all():
-                existing.add((f.category, f.name))
+            existing_map = {}  # (category, name) -> feature_id
+            for f in session.query(Feature.id, Feature.category, Feature.name).all():
+                existing_map[(f.category, f.name)] = f.id

             # Filter out duplicates from input batch
             unique_features = []
+            original_idx_to_id = {}  # Maps original index -> feature ID (existing or to-be-created)
             skipped = 0
-            for feature_data in features:
+            for orig_idx, feature_data in enumerate(features):
                 key = (feature_data["category"], feature_data["name"])
-                if key in existing:
+                if key in existing_map:
+                    # Track existing feature's ID for dependency resolution
+                    original_idx_to_id[orig_idx] = existing_map[key]
                     skipped += 1
                 else:
-                    unique_features.append(feature_data)
-                    existing.add(key)  # Mark as will-exist for batch dedup
+                    unique_features.append((orig_idx, feature_data))
+                    existing_map[key] = None  # Placeholder; ID assigned after flush

             # Third pass: create unique features only
             created_features: list[Feature] = []
-            for i, feature_data in enumerate(unique_features):
+            for orig_idx, feature_data in unique_features:
                 db_feature = Feature(
-                    priority=start_priority + i,
+                    priority=start_priority + len(created_features),
                     category=feature_data["category"],
                     name=feature_data["name"],
                     description=feature_data["description"],
                     steps=feature_data["steps"],
                     passes=False,
                     in_progress=False,
                 )
                 session.add(db_feature)
                 created_features.append(db_feature)

             # Flush to get IDs assigned
             session.flush()

+            # Build complete index-to-ID mapping after flush
+            for i, (orig_idx, _) in enumerate(unique_features):
+                original_idx_to_id[orig_idx] = created_features[i].id
+
             # Fourth pass: resolve index-based dependencies to actual IDs
             deps_count = 0
-            for i, feature_data in enumerate(unique_features):
+            for i, (orig_idx, feature_data) in enumerate(unique_features):
                 indices = feature_data.get("depends_on_indices", [])
                 if indices:
-                    # Convert indices to actual feature IDs
-                    dep_ids = [created_features[idx].id for idx in indices]
+                    # Convert original indices to actual feature IDs (new or existing)
+                    dep_ids = [original_idx_to_id[idx] for idx in indices]
                     created_features[i].dependencies = sorted(dep_ids)
                     deps_count += 1
🧹 Nitpick comments (1)
mcp_server/feature_mcp.py (1)

549-552: Consider selecting only necessary columns for the existence check.

The query now retrieves all (category, name) pairs into memory. This is fine for moderate datasets, but if the feature table grows large, consider adding a database index on (category, name) or using an EXISTS subquery per feature. For now, this approach is acceptable given the described scale (~600 features).

rudiheydra added a commit to rudiheydra/AutoBuildr that referenced this pull request Feb 2, 2026
…AutoForgeAI#148)

- Updated display_derivation.py icon values to match rest of codebase
  (coding->code, testing->test-tube, refactoring->wrench)
- feature_compiler.py now imports TASK_TYPE_ICONS and DEFAULT_ICON from
  display_derivation instead of defining its own duplicate constants
- spec_builder.py already delegates to display_derivation (prior session)
- Updated all test assertions to match consolidated icon values
- Added 34 tests in test_feature_148_display_consolidation.py verifying:
  - Single source of truth (no duplicate definitions)
  - Consistent icon values across all code paths
  - Import chain correctness (feature_compiler -> display_derivation)
  - AgentSpecs created via any path get correct display values

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
rudiheydra added a commit to rudiheydra/AutoBuildr that referenced this pull request Feb 2, 2026
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant