Skip to content

Conversation

@surengunturuamazon
Copy link

Description of changes:
This repository contains AWS Strands SDK implementation of a multi-agent solution on the taubench dataset. This project integrates taubench into a multi-agent solution. Taubench is a benchmarking dataset that allows for standardized performance testing and analysis. This is a demo that utilizes single and multi-agents to run taubench examples. It also includes quick methods for evaluation using metrics from Ragas as well as agent specific evaluation methods such as tool calling percentage, etc.. We will be making further edits in the new year to include agentcore evaluation that was recently announced in AWS ReInvent. This repository will be referenced in our AWS blogpost that we plan to submit in the new year.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Latest scan for commit: 35944cc | Updated: 2026-01-04 06:12:09 UTC

✅ Security Scan Report (PR Files Only)

Scanned Files

  • 02-samples/17-multi-agent+evaluation-airline-assistant/ReadMe.md
  • 02-samples/17-multi-agent+evaluation-airline-assistant/requirements.txt
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/Evaluation.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_multiTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_singleTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_swarm_multiTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_swarm_singleTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-singleAgent_multiTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-singleAgent_singleTurn.ipynb
  • 02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/createGT.py
  • ... and 326 more files

Security Scan Results

Critical High Medium Low Info
0 0 0 0 0

Threshold: High

No security issues detected in your changes. Great job!

This scan only covers files changed in this PR.

@surengunturuamazon surengunturuamazon marked this pull request as draft January 9, 2026 20:56
@surengunturuamazon surengunturuamazon marked this pull request as ready for review January 9, 2026 20:56
@surengunturuamazon surengunturuamazon marked this pull request as draft January 9, 2026 20:57
Copy link
Member

@cagataycali cagataycali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This is a substantial contribution showcasing multi-agent evaluation using Taubench with the Strands SDK. The PR provides comprehensive examples of single-agent and multi-agent approaches with evaluation using Ragas metrics.

✅ Highlights

  1. Comprehensive Coverage: 7 Jupyter notebooks covering:

    • Single Agent (single/multi turn)
    • Multi Agent (single/multi turn)
    • Swarm Multi Agent (single/multi turn)
    • Evaluation notebook
  2. Real Benchmarking: Using Taubench provides standardized, reproducible performance testing.

  3. Ragas Integration: Including evaluation with Ragas metrics adds scientific rigor to the examples.

  4. Blog Reference: The mention of an upcoming AWS blog post suggests this is production-quality content.

💡 Suggestions

  1. CI Status: I notice CI isn't running - this may need maintainer approval to trigger the workflow.

  2. Notebook Outputs: Consider clearing notebook outputs before merge to reduce diff size (the notebooks have ~9000+ lines of additions, likely from execution outputs).

  3. Requirements File: The requirements.txt with 4 dependencies is good - consider pinning versions for reproducibility.

  4. AgentCore Evaluation: The mention of future AgentCore evaluation integration is exciting - consider opening a follow-up issue to track this.

🎯 Recommendation

This is valuable content for the samples repository. Once CI passes, this should be merged. The multi-agent + evaluation combination fills an important gap in the examples.


Review by strands-coder autonomous agent 🤖

Copy link
Member

@cagataycali cagataycali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Review by strands-coder

Thank you @surengunturuamazon for this comprehensive multi-agent evaluation sample!

📋 Summary

This PR adds a complete Taubench airline example demonstrating:

  • Single-agent and multi-agent implementations
  • Swarm-based multi-agent orchestration
  • Single-turn and multi-turn conversation patterns
  • RAGAS-based evaluation framework

📊 Scope

  • ~9,300 lines added across 30 files
  • 7 Jupyter notebooks covering different agent configurations
  • Evaluation framework with RAGAS integration
  • Comprehensive documentation

✅ Strengths

  1. Comprehensive coverage - Single/Multi-agent, Single/Multi-turn, Swarm patterns
  2. Evaluation focus - Integrates RAGAS metrics and tool calling percentage
  3. Blog-ready - Designed to support AWS blog post
  4. Future-proofed - Mentions AgentCore evaluation integration coming soon

💡 Suggestions (Non-blocking)

  1. Consider adding a setup.py or making it pip-installable for easier dependency management
  2. The RAGAS evaluation subfolder has its own LICENSE (Apache 2.0) - ensure this aligns with samples repo conventions
  3. Large notebook files - consider adding .gitattributes with *.ipynb linguist-detectable=false to improve PR readability

📝 Questions

  1. Are there specific Strands SDK version requirements for this sample?
  2. Will the AgentCore evaluation additions be in a follow-up PR?

Status: Looking good! Waiting for CI to run before final approval.


Review by strands-coder autonomous agent 🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants