user-guide: add gateway-failover documentation #259

Fredi-raspall · 2026-01-28T22:55:53Z

Closes: #248
Unsure if this closes #249

github-actions · 2026-01-28T22:56:31Z

🚀 Deployed on https://preview-259--hedgehog-docs.netlify.app

Copilot

Pull request overview

This PR extends the user guide to document gateway redundancy/fail-over behavior and integrates the new material into the navigation and existing gateway docs. It also slightly refines existing gateway-related titles to better reflect their scope.

Changes:

Add a dedicated “Gateway fail-over and redundancy” user-guide page explaining gateway groups, traffic mapping, and fail-over behavior.
Link the new page from the overview and the .pages navigation under a new “Gateway” section.
Retitle the main gateway and gateway-add docs to “Gateway overview” and “Adding Gateways to the fabric” for clearer context.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
docs/user-guide/overview.md	Adds a TOC entry pointing to the new gateway-failover documentation so users can discover redundancy guidance.
docs/user-guide/gateway.md	Renames the main heading to “Gateway overview” to clarify that this page introduces gateway concepts now complemented by a separate fail-over page.
docs/user-guide/gateway-failover.md	Introduces detailed documentation for gateway redundancy, gateway groups, traffic mapping, and fail-over behavior, including configuration snippets and design rationale.
docs/user-guide/gateway-add.md	Updates the title to “Adding Gateways to the fabric” to align with a more general multi-gateway deployment story.
docs/user-guide/.pages	Groups gateway-related docs under a “Gateway” nav section and includes the new fail-over page, improving navigation around gateway topics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-29T09:46:36Z

docs/user-guide/gateway-failover.md

+    Since there always exists a *default* gateway group (containing all of the gateways), this Active-Backup behavior is (for any number of gateways) the default when no additional configuration is provided.
+
+### Customizing fail-over setups: traffic mapping to gateway groups
+With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.


The phrase "VPC peerings can be specified a gatewayGroup" is ungrammatical; this should read something like "can be specified with a gatewayGroup" or "can specify a gatewayGroup" for clarity.

Suggested change

With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.

With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified with a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.

Copilot · 2026-01-29T09:46:37Z

docs/user-guide/gateway-failover.md

+apiVersion: v1
+items:
+- apiVersion: gateway.githedgehog.com/v1alpha1
+  kind: GatewayGroup
+  metadata:
+    name: group-1
+    namespace: default
+  spec: {}


The GatewayGroup YAML example here uses a top-level apiVersion: v1 with an items: list but no kind: List, which is not a directly usable Kubernetes manifest; it would be clearer and more accurate either to show a single GatewayGroup object on its own, or to include the missing kind: List and associated fields so that the snippet matches real kubectl output.

Suggested change

apiVersion: v1

items:

- apiVersion: gateway.githedgehog.com/v1alpha1

kind: GatewayGroup

metadata:

name: group-1

namespace: default

spec: {}

apiVersion: gateway.githedgehog.com/v1alpha1

kind: GatewayGroup

metadata:

name: group-1

namespace: default

spec: {}

Copilot · 2026-01-29T09:46:37Z

docs/user-guide/gateway-failover.md

@@ -0,0 +1,106 @@
+# Gateway fail-over and redundancy
+## Overview
+When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.


"HedgeHog Fabric" here is inconsistent with the rest of the docs (and even with "Hedgehog Fabric" a few lines below); to keep branding consistent, this should use the same capitalization as elsewhere ("Hedgehog Fabric").

Suggested change

When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.

When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a Hedgehog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.

Copilot · 2026-01-29T09:46:37Z

docs/user-guide/gateway-failover.md

+With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.
+
+!!! info inline end
+    A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup.


There is a duplicated word in this sentence ("assigns it the the default group"); one of the "the" tokens should be removed.

Suggested change

A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup.

A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the *default* group. This is why, without any configuration, the default behavior is Active-Backup.

qmonnet

Great document!

I'm usually picky with the style in the docs (Logan knows something about it), so I've got tons of nitpicks, but nothing major.

I'd also wrap the text on 80-character lines as I find it easier to diff and work with smaller lines, although I'm not sure we have a consensus about that.

One comment would be to remain careful with the number of admonitions (!!! note) in the document. It's good to have a few ones to insert visual pauses in long sections, but having too many ones may break the flow. You have quite a number of nots, and I think some of them could be regular paragraphs and it would help with overall readability.

qmonnet · 2026-01-29T10:49:33Z

docs/user-guide/gateway-failover.md

+When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next.
+
+!!! note
+    When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable  by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.


Suggested change

When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.

Gateway "failures", do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against all of these failures.

qmonnet · 2026-01-29T10:50:30Z

docs/user-guide/gateway-failover.md

+    When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable  by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.
+
+### Gateway groups
+Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).


Suggested change

Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).

Gateway fail-over strategies build on the concept of *gateway groups*. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members).

This is not a literal coming from the code or from API definitions, I'd just use emphasis rather than formatting as a literal

qmonnet · 2026-01-29T10:51:14Z

docs/user-guide/gateway-failover.md

+	All gateways in a fabric are members of, at least, one *default* group that always exists.
+
+Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**.


Suggested change

All gateways in a fabric are members of, at least, one *default* group that always exists.

Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**.

All gateways in a fabric are members of, at least, one `default` group that always exists.

Declaring gateway groups is done by means of the `GatewayGroup` object. The following sample snippet shows the declaration of a group called `group-1`.

Conversely, these are keywords or names used in the YAML, so I'd format as literals.

Same comment applies to the rest of the document.

qmonnet · 2026-01-29T10:53:05Z

docs/user-guide/gateway-failover.md

+spec:
+  asn: 65534
+  groups:
+    - name: default # all gateways belong to this group


Does this mean this field is mandatory? If so, add it in the comment? (“mandatory - all gateways belong ...”)

qmonnet · 2026-01-29T11:00:53Z

docs/user-guide/gateway-failover.md

+!!! note
+	The value of the priority specified by a gateway within a group has no significance in absolute terms. For instance, configuring two gateways as members of the same group with priorities 200 and 100 has the same effect as configuring them with priorities 29 and 3.
+
+The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.


I don't think the “not only...” brings any useful info here, I'd discard that part.

Suggested change

The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.

Priorities within a group ensure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.

qmonnet · 2026-01-29T11:23:31Z

docs/user-guide/gateway-failover.md

+!!! tip
+	Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.
+
+## Fail-over behind the scenes and recommendations


Suggested change

## Fail-over behind the scenes and recommendations

## Fail-over under the hood and recommendations

I think “behind the scene” would rather mean “in secret”, “without the user noticing”

qmonnet · 2026-01-29T11:25:48Z

docs/user-guide/gateway-failover.md

+	Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.
+
+## Fail-over behind the scenes and recommendations
+The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches)  select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.


Suggested change

The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches) select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.

The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (such as leaf switches) select the gateway that handles each packet based on those prefixes and priorities. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway.

qmonnet · 2026-01-29T11:27:38Z

docs/user-guide/gateway-failover.md

+Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links.
+
+!!! note "Takeaways and configuration summary"
+    Redundancy works out of the box in an active-backup fashion. Customizing it generally requires:


Suggested change

Redundancy works out of the box in an active-backup fashion. Customizing it generally requires:

Redundancy works out of the box in an Active-Backup fashion. To customize redundancy's behavior:

Case: update for consistency with the rest of the doc.

qmonnet · 2026-01-29T11:29:56Z

docs/user-guide/gateway-failover.md

+    1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available.
+    2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.
+    3. Mapping the VPC peerings to the groups defined.
+    4. Enabling BFD on gateway links.


Suggested change

1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available.

2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.

3. Mapping the VPC peerings to the groups defined.

4. Enabling BFD on gateway links.

1. Declare gateway groups (**GatewayGroup** objects) depending on the number of gateways available.

2. Assign gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups.

3. Map the VPC peerings to the groups defined.

4. Enable BFD on gateway links.

qmonnet · 2026-01-29T11:30:10Z

docs/user-guide/gateway-failover.md

+
+Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links.
+
+!!! note "Takeaways and configuration summary"


I like these ❤️

Signed-off-by: Fredi Raspall <fredi@githedgehog.com>

Fredi-raspall requested a review from a team as a code owner January 28, 2026 22:55

Fredi-raspall requested a review from mrbojangles3 January 28, 2026 22:55

Fredi-raspall force-pushed the pr/fredi/gw_failover branch 2 times, most recently from 6947ea5 to 2e2879c Compare January 29, 2026 09:24

qmonnet requested a review from Copilot January 29, 2026 09:43

Copilot started reviewing on behalf of qmonnet January 29, 2026 09:44 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

qmonnet requested changes Jan 29, 2026

View reviewed changes

Fredi-raspall added 2 commits January 29, 2026 14:41

user-guide: add gateway-failover documentation

dfb449e

Signed-off-by: Fredi Raspall <fredi@githedgehog.com>

user-guide: reorg gateway-related doc pages

df0fc6c

Signed-off-by: Fredi Raspall <fredi@githedgehog.com>

Fredi-raspall force-pushed the pr/fredi/gw_failover branch from 2e2879c to df0fc6c Compare January 29, 2026 13:42

	With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a gatewayGroup to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.
	With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified with a gatewayGroup to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below.

	A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the default group. This is why, without any configuration, the default behavior is Active-Backup.
	A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the default group. This is why, without any configuration, the default behavior is Active-Backup.

	When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well.
	Gateway "failures", do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against all of these failures.

		All gateways in a fabric are members of, at least, one default group that always exists.

		Declaring gateway groups is done by means of the GatewayGroup object. The following sample snippet shows the declaration of a group called group-1.

	The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.
	Priorities within a group ensure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future.

	## Fail-over behind the scenes and recommendations
	## Fail-over under the hood and recommendations

	Redundancy works out of the box in an active-backup fashion. Customizing it generally requires:
	Redundancy works out of the box in an Active-Backup fashion. To customize redundancy's behavior:


		Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links.

		!!! note "Takeaways and configuration summary"

user-guide: add gateway-failover documentation #259

Are you sure you want to change the base?

user-guide: add gateway-failover documentation #259

Uh oh!

Conversation

Fredi-raspall commented Jan 28, 2026

Uh oh!

github-actions bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 28, 2026 •

edited

Loading