Skip to content

Conversation

@Fredi-raspall
Copy link
Contributor

Closes: #248
Unsure if this closes #249

@Fredi-raspall Fredi-raspall requested a review from a team as a code owner January 28, 2026 22:55
@github-actions
Copy link

github-actions bot commented Jan 28, 2026

🚀 Deployed on https://preview-259--hedgehog-docs.netlify.app

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/gw_failover branch 2 times, most recently from 6947ea5 to 2e2879c Compare January 29, 2026 09:24
@qmonnet qmonnet requested a review from Copilot January 29, 2026 09:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the user guide to document gateway redundancy/fail-over behavior and integrates the new material into the navigation and existing gateway docs. It also slightly refines existing gateway-related titles to better reflect their scope.

Changes:

  • Add a dedicated “Gateway fail-over and redundancy” user-guide page explaining gateway groups, traffic mapping, and fail-over behavior.
  • Link the new page from the overview and the .pages navigation under a new “Gateway” section.
  • Retitle the main gateway and gateway-add docs to “Gateway overview” and “Adding Gateways to the fabric” for clearer context.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/user-guide/overview.md Adds a TOC entry pointing to the new gateway-failover documentation so users can discover redundancy guidance.
docs/user-guide/gateway.md Renames the main heading to “Gateway overview” to clarify that this page introduces gateway concepts now complemented by a separate fail-over page.
docs/user-guide/gateway-failover.md Introduces detailed documentation for gateway redundancy, gateway groups, traffic mapping, and fail-over behavior, including configuration snippets and design rationale.
docs/user-guide/gateway-add.md Updates the title to “Adding Gateways to the fabric” to align with a more general multi-gateway deployment story.
docs/user-guide/.pages Groups gateway-related docs under a “Gateway” nav section and includes the new fail-over page, improving navigation around gateway topics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Since there always exists a *default* gateway group (containing all of the gateways), this Active-Backup behavior is (for any number of gateways) the default when no additional configuration is provided.

### Customizing fail-over setups: traffic mapping to gateway groups
With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "VPC peerings can be specified a gatewayGroup" is ungrammatical; this should read something like "can be specified with a gatewayGroup" or "can specify a gatewayGroup" for clarity.

Suggested change
With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.
With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified with a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.

Copilot uses AI. Check for mistakes.
Comment on lines 17 to 24
apiVersion: v1
items:
- apiVersion: gateway.githedgehog.com/v1alpha1
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GatewayGroup YAML example here uses a top-level apiVersion: v1 with an items: list but no kind: List, which is not a directly usable Kubernetes manifest; it would be clearer and more accurate either to show a single GatewayGroup object on its own, or to include the missing kind: List and associated fields so that the snippet matches real kubectl output.

Suggested change
apiVersion: v1
items:
- apiVersion: gateway.githedgehog.com/v1alpha1
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}
apiVersion: gateway.githedgehog.com/v1alpha1
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,106 @@
# Gateway fail-over and redundancy
## Overview
When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HedgeHog Fabric" here is inconsistent with the rest of the docs (and even with "Hedgehog Fabric" a few lines below); to keep branding consistent, this should use the same capitalization as elsewhere ("Hedgehog Fabric").

Suggested change
When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.
When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a Hedgehog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.

Copilot uses AI. Check for mistakes.
With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.

!!! info inline end
A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup.
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicated word in this sentence ("assigns it the the default group"); one of the "the" tokens should be removed.

Suggested change
A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup.
A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the *default* group. This is why, without any configuration, the default behavior is Active-Backup.

Copilot uses AI. Check for mistakes.
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great document!

I'm usually picky with the style in the docs (Logan knows something about it), so I've got tons of nitpicks, but nothing major.

I'd also wrap the text on 80-character lines as I find it easier to diff and work with smaller lines, although I'm not sure we have a consensus about that.

One comment would be to remain careful with the number of admonitions (!!! note) in the document. It's good to have a few ones to insert visual pauses in long sections, but having too many ones may break the flow. You have quite a number of nots, and I think some of them could be regular paragraphs and it would help with overall readability.

When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.

!!! note
When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.
Gateway "failures", do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against all of these failures.

When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.

### Gateway groups
Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).
Gateway fail-over strategies build on the concept of *gateway groups*. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).

This is not a literal coming from the code or from API definitions, I'd just use emphasis rather than formatting as a literal

Comment on lines 12 to 14
All gateways in a fabric are members of, at least, one *default* group that always exists.

Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All gateways in a fabric are members of, at least, one *default* group that always exists.
Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**.
All gateways in a fabric are members of, at least, one `default` group that always exists.
Declaring gateway groups is done by means of the `GatewayGroup` object. The following sample snippet shows the declaration of a group called `group-1`.

Conversely, these are keywords or names used in the YAML, so I'd format as literals.

Same comment applies to the rest of the document.

spec:
asn: 65534
groups:
- name: default # all gateways belong to this group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean this field is mandatory? If so, add it in the comment? (“mandatory - all gateways belong ...”)

!!! note
The value of the priority specified by a gateway within a group has no significance in absolute terms. For instance, configuring two gateways as members of the same group with priorities 200 and 100 has the same effect as configuring them with priorities 29 and 3.

The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the “not only...” brings any useful info here, I'd discard that part.

Suggested change
The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.
Priorities within a group ensure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.

!!! tip
Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.

## Fail-over behind the scenes and recommendations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Fail-over behind the scenes and recommendations
## Fail-over under the hood and recommendations

I think “behind the scene” would rather mean “in secret”, “without the user noticing”

Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.

## Fail-over behind the scenes and recommendations
The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches) select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches) select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.
The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (such as leaf switches) select the gateway that handles each packet based on those prefixes and priorities. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.

Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links.

!!! note "Takeaways and configuration summary"
Redundancy works out of the box in an active-backup fashion. Customizing it generally requires:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Redundancy works out of the box in an active-backup fashion. Customizing it generally requires:
Redundancy works out of the box in an Active-Backup fashion. To customize redundancy's behavior:

Case: update for consistency with the rest of the doc.

Comment on lines +103 to +104
1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available.
2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.
3. Mapping the VPC peerings to the groups defined.
4. Enabling BFD on gateway links.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available.
2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.
3. Mapping the VPC peerings to the groups defined.
4. Enabling BFD on gateway links.
1. Declare gateway groups (**GatewayGroup** objects) depending on the number of gateways available.
2. Assign gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.
3. Map the VPC peerings to the groups defined.
4. Enable BFD on gateway links.


Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links.

!!! note "Takeaways and configuration summary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these ❤️

Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update all docs to suggest redundant gateway setup Document how Gateway redundancy works

3 participants