Skip to content

Conversation

@amaslenn
Copy link
Contributor

Summary

Stream API can be sensitive to network issues, delays, broken ping, etc. Using kubectl is more robust for this case.

Test Plan

  1. CI
  2. Manual run on an internal cluster.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

The change replaces in-cluster Kubernetes API streaming with a subprocess.run call executing kubectl exec directly. Command output is written to genai_perf.log with enhanced error handling, and exception handling is broadened from specific API exceptions to general exceptions with debug-level logging.

Changes

Cohort / File(s) Summary
Kubernetes Execution Method
src/cloudai/systems/kubernetes/kubernetes_system.py
Replaced lazy.k8s.stream-based execution with subprocess.run calling kubectl exec; modified output capture to write stdout/stderr to genai_perf.log; broadened exception handling and shifted error logging to debug level; added exit code logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 The stream once flowed through API gates,
Now subprocess hops and navigates,
kubectl exec takes the lead with grace,
Output captured in its rightful place,
Exceptions caught with softer paws! 🌿

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: replacing Stream API with kubectl for more robust Kubernetes bench execution for Dynamo.
Description check ✅ Passed The description is directly related to the changeset, explaining the motivation (Stream API sensitivity to network issues) and the solution (using kubectl instead).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch am/kube-exec

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Replaced Kubernetes Python Stream API with kubectl exec subprocess for executing genai-perf benchmarks in Dynamo frontend pods. This improves robustness against network issues and API timeouts.

Key Changes:

  • Switched from lazy.k8s.stream.stream() to subprocess.run() with kubectl
  • Changed exception handling from specific ApiException to generic Exception
  • Added separate stderr output capture in logs
  • Downgraded error logging to debug level

Issues Found:

  • Return code not validated - failures proceed silently
  • Error logging downgraded from error to debug level reduces visibility
  • Generic exception handling masks specific failure types

Confidence Score: 3/5

  • Safe to merge with minor improvements recommended for error handling
  • The approach is sound and addresses the stated network robustness issue, but missing return code validation and downgraded error logging could hide failures in production
  • Pay close attention to kubernetes_system.py:316-324 for error handling improvements

Important Files Changed

Filename Overview
src/cloudai/systems/kubernetes/kubernetes_system.py Replaced Kubernetes Stream API with kubectl subprocess for better network resilience, but missing return code validation

Sequence Diagram

sequenceDiagram
    participant System as KubernetesSystem
    participant Subprocess as subprocess.run
    participant Kubectl as kubectl exec
    participant Pod as Frontend Pod
    participant GenAI as genai-perf

    System->>System: _run_genai_perf(job)
    System->>System: _get_dynamo_pod_by_role("frontend")
    System->>System: Build kubectl_exec_cmd
    System->>Subprocess: run(kubectl_exec_cmd, timeout=600)
    Subprocess->>Kubectl: Execute kubectl exec
    Kubectl->>Pod: Connect to pod
    Pod->>GenAI: Run genai-perf command
    GenAI-->>Pod: stdout/stderr output
    Pod-->>Kubectl: Return output
    Kubectl-->>Subprocess: Return result
    Subprocess-->>System: Return result object
    System->>System: Log return code
    System->>System: Write stdout to genai_perf.log
    alt stderr exists
        System->>System: Append stderr to log
    end
    System->>Subprocess: kubectl cp results
    System->>System: Copy genai-perf results locally
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 316-324: The current subprocess.run call in kubernetes_system.py
(the try block around kubectl_exec_cmd) only logs result.returncode and swallows
failures; change it to either pass check=True to subprocess.run or explicitly
check result.returncode and raise a subprocess.CalledProcessError when non-zero,
and re-raise TimeoutExpired and other exceptions instead of only logging them so
failures propagate; update the except block to log and then raise the caught
exception (so _genai_perf_completed is only set when the command actually
succeeds).
- Around line 313-316: The kubectl subprocess command built in kubectl_exec_cmd
does not pass the resolved kubeconfig, so kubectl may target the wrong cluster
when self.kube_config_path differs from the default; update the construction of
kubectl_exec_cmd (the list used in subprocess.run for running genai_perf_cmd) to
include the flag "--kubeconfig" followed by self.kube_config_path (or its
resolved path used by load_kube_config()) before the "exec" invocation so the
subprocess uses the same kubeconfig as the API client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant