-
Notifications
You must be signed in to change notification settings - Fork 311
Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Latest scan for commit: ✅ Security Scan Report (PR Files Only)Scanned Files
Security Scan Results
Threshold: High No security issues detected in your changes. Great job! This scan only covers files changed in this PR. |
cagataycali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📋 Review Summary
This is a substantial contribution showcasing multi-agent evaluation using Taubench with the Strands SDK. The PR provides comprehensive examples of single-agent and multi-agent approaches with evaluation using Ragas metrics.
✅ Highlights
-
Comprehensive Coverage: 7 Jupyter notebooks covering:
- Single Agent (single/multi turn)
- Multi Agent (single/multi turn)
- Swarm Multi Agent (single/multi turn)
- Evaluation notebook
-
Real Benchmarking: Using Taubench provides standardized, reproducible performance testing.
-
Ragas Integration: Including evaluation with Ragas metrics adds scientific rigor to the examples.
-
Blog Reference: The mention of an upcoming AWS blog post suggests this is production-quality content.
💡 Suggestions
-
CI Status: I notice CI isn't running - this may need maintainer approval to trigger the workflow.
-
Notebook Outputs: Consider clearing notebook outputs before merge to reduce diff size (the notebooks have ~9000+ lines of additions, likely from execution outputs).
-
Requirements File: The
requirements.txtwith 4 dependencies is good - consider pinning versions for reproducibility. -
AgentCore Evaluation: The mention of future AgentCore evaluation integration is exciting - consider opening a follow-up issue to track this.
🎯 Recommendation
This is valuable content for the samples repository. Once CI passes, this should be merged. The multi-agent + evaluation combination fills an important gap in the examples.
Review by strands-coder autonomous agent 🤖
cagataycali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔍 Review by strands-coder
Thank you @surengunturuamazon for this comprehensive multi-agent evaluation sample!
📋 Summary
This PR adds a complete Taubench airline example demonstrating:
- Single-agent and multi-agent implementations
- Swarm-based multi-agent orchestration
- Single-turn and multi-turn conversation patterns
- RAGAS-based evaluation framework
📊 Scope
- ~9,300 lines added across 30 files
- 7 Jupyter notebooks covering different agent configurations
- Evaluation framework with RAGAS integration
- Comprehensive documentation
✅ Strengths
- Comprehensive coverage - Single/Multi-agent, Single/Multi-turn, Swarm patterns
- Evaluation focus - Integrates RAGAS metrics and tool calling percentage
- Blog-ready - Designed to support AWS blog post
- Future-proofed - Mentions AgentCore evaluation integration coming soon
💡 Suggestions (Non-blocking)
- Consider adding a
setup.pyor making it pip-installable for easier dependency management - The RAGAS evaluation subfolder has its own LICENSE (Apache 2.0) - ensure this aligns with samples repo conventions
- Large notebook files - consider adding
.gitattributeswith*.ipynb linguist-detectable=falseto improve PR readability
📝 Questions
- Are there specific Strands SDK version requirements for this sample?
- Will the AgentCore evaluation additions be in a follow-up PR?
Status: Looking good! Waiting for CI to run before final approval.
Review by strands-coder autonomous agent 🤖
Description of changes:
This repository contains AWS Strands SDK implementation of a multi-agent solution on the taubench dataset. This project integrates taubench into a multi-agent solution. Taubench is a benchmarking dataset that allows for standardized performance testing and analysis. This is a demo that utilizes single and multi-agents to run taubench examples. It also includes quick methods for evaluation using metrics from Ragas as well as agent specific evaluation methods such as tool calling percentage, etc.. We will be making further edits in the new year to include agentcore evaluation that was recently announced in AWS ReInvent. This repository will be referenced in our AWS blogpost that we plan to submit in the new year.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.