Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208

surengunturuamazon · 2025-12-26T19:14:56Z

Description of changes:
This repository contains AWS Strands SDK implementation of a multi-agent solution on the taubench dataset. This project integrates taubench into a multi-agent solution. Taubench is a benchmarking dataset that allows for standardized performance testing and analysis. This is a demo that utilizes single and multi-agents to run taubench examples. It also includes quick methods for evaluation using metrics from Ragas as well as agent specific evaluation methods such as tool calling percentage, etc.. We will be making further edits in the new year to include agentcore evaluation that was recently announced in AWS ReInvent. This repository will be referenced in our AWS blogpost that we plan to submit in the new year.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-01-04T06:12:09Z

Latest scan for commit: 35944cc | Updated: 2026-01-04 06:12:09 UTC

✅ Security Scan Report (PR Files Only)

Scanned Files

02-samples/17-multi-agent+evaluation-airline-assistant/ReadMe.md
02-samples/17-multi-agent+evaluation-airline-assistant/requirements.txt
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/Evaluation.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_multiTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_singleTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_swarm_multiTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-multiAgent_swarm_singleTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-singleAgent_multiTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/awsStrands-singleAgent_singleTurn.ipynb
02-samples/17-multi-agent+evaluation-airline-assistant/taubench/src/createGT.py
... and 326 more files

Security Scan Results

Critical	High	Medium	Low	Info
0	0	0	0	0

Threshold: High

No security issues detected in your changes. Great job!

This scan only covers files changed in this PR.

cagataycali

📋 Review Summary

This is a substantial contribution showcasing multi-agent evaluation using Taubench with the Strands SDK. The PR provides comprehensive examples of single-agent and multi-agent approaches with evaluation using Ragas metrics.

✅ Highlights

Comprehensive Coverage: 7 Jupyter notebooks covering:
- Single Agent (single/multi turn)
- Multi Agent (single/multi turn)
- Swarm Multi Agent (single/multi turn)
- Evaluation notebook
Real Benchmarking: Using Taubench provides standardized, reproducible performance testing.
Ragas Integration: Including evaluation with Ragas metrics adds scientific rigor to the examples.
Blog Reference: The mention of an upcoming AWS blog post suggests this is production-quality content.

💡 Suggestions

CI Status: I notice CI isn't running - this may need maintainer approval to trigger the workflow.
Notebook Outputs: Consider clearing notebook outputs before merge to reduce diff size (the notebooks have ~9000+ lines of additions, likely from execution outputs).
Requirements File: The requirements.txt with 4 dependencies is good - consider pinning versions for reproducibility.
AgentCore Evaluation: The mention of future AgentCore evaluation integration is exciting - consider opening a follow-up issue to track this.

🎯 Recommendation

This is valuable content for the samples repository. Once CI passes, this should be merged. The multi-agent + evaluation combination fills an important gap in the examples.

Review by strands-coder autonomous agent 🤖

cagataycali

🔍 Review by strands-coder

Thank you @surengunturuamazon for this comprehensive multi-agent evaluation sample!

📋 Summary

This PR adds a complete Taubench airline example demonstrating:

Single-agent and multi-agent implementations
Swarm-based multi-agent orchestration
Single-turn and multi-turn conversation patterns
RAGAS-based evaluation framework

📊 Scope

~9,300 lines added across 30 files
7 Jupyter notebooks covering different agent configurations
Evaluation framework with RAGAS integration
Comprehensive documentation

✅ Strengths

Comprehensive coverage - Single/Multi-agent, Single/Multi-turn, Swarm patterns
Evaluation focus - Integrates RAGAS metrics and tool calling percentage
Blog-ready - Designed to support AWS blog post
Future-proofed - Mentions AgentCore evaluation integration coming soon

💡 Suggestions (Non-blocking)

Consider adding a setup.py or making it pip-installable for easier dependency management
The RAGAS evaluation subfolder has its own LICENSE (Apache 2.0) - ensure this aligns with samples repo conventions
Large notebook files - consider adding .gitattributes with *.ipynb linguist-detectable=false to improve PR readability

📝 Questions

Are there specific Strands SDK version requirements for this sample?
Will the AgentCore evaluation additions be in a follow-up PR?

Status: Looking good! Waiting for CI to run before final approval.

Review by strands-coder autonomous agent 🤖

add multi-agent airline assistant ex

35944cc

surengunturuamazon marked this pull request as draft January 9, 2026 20:56

surengunturuamazon marked this pull request as ready for review January 9, 2026 20:56

surengunturuamazon marked this pull request as draft January 9, 2026 20:57

surengunturuamazon closed this Jan 9, 2026

surengunturuamazon reopened this Jan 9, 2026

cagataycali reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208

Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208

Uh oh!

surengunturuamazon commented Dec 26, 2025

Uh oh!

github-actions bot commented Jan 4, 2026

Uh oh!

cagataycali left a comment

Uh oh!

cagataycali left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208

Are you sure you want to change the base?

Running and Evaluating Multi-Agents Using AWS Strands - Taubench Airline Example #208

Uh oh!

Conversation

surengunturuamazon commented Dec 26, 2025

Uh oh!

github-actions bot commented Jan 4, 2026

✅ Security Scan Report (PR Files Only)

Scanned Files

Security Scan Results

Uh oh!

cagataycali left a comment

Choose a reason for hiding this comment

📋 Review Summary

✅ Highlights

💡 Suggestions

🎯 Recommendation

Uh oh!

cagataycali left a comment

Choose a reason for hiding this comment

🔍 Review by strands-coder

📋 Summary

📊 Scope

✅ Strengths

💡 Suggestions (Non-blocking)

📝 Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants