-
Notifications
You must be signed in to change notification settings - Fork 8
user-guide: add gateway-failover documentation #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
🚀 Deployed on https://preview-259--hedgehog-docs.netlify.app |
6947ea5 to
2e2879c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends the user guide to document gateway redundancy/fail-over behavior and integrates the new material into the navigation and existing gateway docs. It also slightly refines existing gateway-related titles to better reflect their scope.
Changes:
- Add a dedicated “Gateway fail-over and redundancy” user-guide page explaining gateway groups, traffic mapping, and fail-over behavior.
- Link the new page from the overview and the
.pagesnavigation under a new “Gateway” section. - Retitle the main gateway and gateway-add docs to “Gateway overview” and “Adding Gateways to the fabric” for clearer context.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/user-guide/overview.md | Adds a TOC entry pointing to the new gateway-failover documentation so users can discover redundancy guidance. |
| docs/user-guide/gateway.md | Renames the main heading to “Gateway overview” to clarify that this page introduces gateway concepts now complemented by a separate fail-over page. |
| docs/user-guide/gateway-failover.md | Introduces detailed documentation for gateway redundancy, gateway groups, traffic mapping, and fail-over behavior, including configuration snippets and design rationale. |
| docs/user-guide/gateway-add.md | Updates the title to “Adding Gateways to the fabric” to align with a more general multi-gateway deployment story. |
| docs/user-guide/.pages | Groups gateway-related docs under a “Gateway” nav section and includes the new fail-over page, improving navigation around gateway topics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docs/user-guide/gateway-failover.md
Outdated
| Since there always exists a *default* gateway group (containing all of the gateways), this Active-Backup behavior is (for any number of gateways) the default when no additional configuration is provided. | ||
|
|
||
| ### Customizing fail-over setups: traffic mapping to gateway groups | ||
| With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below. |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase "VPC peerings can be specified a gatewayGroup" is ungrammatical; this should read something like "can be specified with a gatewayGroup" or "can specify a gatewayGroup" for clarity.
| With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below. | |
| With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified with a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below. |
docs/user-guide/gateway-failover.md
Outdated
| apiVersion: v1 | ||
| items: | ||
| - apiVersion: gateway.githedgehog.com/v1alpha1 | ||
| kind: GatewayGroup | ||
| metadata: | ||
| name: group-1 | ||
| namespace: default | ||
| spec: {} |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GatewayGroup YAML example here uses a top-level apiVersion: v1 with an items: list but no kind: List, which is not a directly usable Kubernetes manifest; it would be clearer and more accurate either to show a single GatewayGroup object on its own, or to include the missing kind: List and associated fields so that the snippet matches real kubectl output.
| apiVersion: v1 | |
| items: | |
| - apiVersion: gateway.githedgehog.com/v1alpha1 | |
| kind: GatewayGroup | |
| metadata: | |
| name: group-1 | |
| namespace: default | |
| spec: {} | |
| apiVersion: gateway.githedgehog.com/v1alpha1 | |
| kind: GatewayGroup | |
| metadata: | |
| name: group-1 | |
| namespace: default | |
| spec: {} |
docs/user-guide/gateway-failover.md
Outdated
| @@ -0,0 +1,106 @@ | |||
| # Gateway fail-over and redundancy | |||
| ## Overview | |||
| When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next. | |||
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"HedgeHog Fabric" here is inconsistent with the rest of the docs (and even with "Hedgehog Fabric" a few lines below); to keep branding consistent, this should use the same capitalization as elsewhere ("Hedgehog Fabric").
| When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next. | |
| When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a Hedgehog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next. |
docs/user-guide/gateway-failover.md
Outdated
| With the previous setup, one of the two gateways remains idle, which can be sub-optimal and under-utilize resources. In order to overcome this, VPC peerings can be specified a **gatewayGroup** to indicate the name of the gateway group that should serve the traffic for that peering, as shown by the snippet below. | ||
|
|
||
| !!! info inline end | ||
| A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup. |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a duplicated word in this sentence ("assigns it the the default group"); one of the "the" tokens should be removed.
| A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the the *default* group. This is why, without any configuration, the default behavior is Active-Backup. | |
| A VPC peering always has a gatewayGroup: if not explicitly set, the system automatically assigns it the *default* group. This is why, without any configuration, the default behavior is Active-Backup. |
qmonnet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great document!
I'm usually picky with the style in the docs (Logan knows something about it), so I've got tons of nitpicks, but nothing major.
I'd also wrap the text on 80-character lines as I find it easier to diff and work with smaller lines, although I'm not sure we have a consensus about that.
One comment would be to remain careful with the number of admonitions (!!! note) in the document. It's good to have a few ones to insert visual pauses in long sections, but having too many ones may break the flow. You have quite a number of nots, and I think some of them could be regular paragraphs and it would help with overall readability.
docs/user-guide/gateway-failover.md
Outdated
| When VPC *peerings* are configured to use a gateway, the latter is responsible for the delivery of the traffic exchanged between the VPCs on each side of the peering, fabric-wide. Failures in a gateway, its interconnects, or neighboring nodes can cause connectivity interruptions. Much as link protection is accomplished with interconnect redundancy, gateway failures are mitigated by deploying additional gateways. When more than one gateway is present in a HedgeHog Fabric, flexible fail-over strategies are possible to minimize service interruptions, as explained next. | ||
|
|
||
| !!! note | ||
| When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well. | |
| Gateway "failures", do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against all of these failures. |
docs/user-guide/gateway-failover.md
Outdated
| When we talk about gateway "failures", we do not necessarily refer to physical issues with the gateway device, its cabling or its software. Any condition that prevents a gateway from being reachable by fabric edges (such as multiple neighbor failures or their cabling) falls in this category. The fail-over strategy of the Hedgehog Fabric is designed to protect against those as well. | ||
|
|
||
| ### Gateway groups | ||
| Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Gateway fail-over strategies build on the concept of `gateway groups`. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members). | |
| Gateway fail-over strategies build on the concept of *gateway groups*. A gateway group is a configurable, named set of gateways ranked by priority, such that the member gateway with highest priority is preferred over the rest, provided, of course, that it is operational. A gateway can be a member of one or more gateway groups, and there is no limit on the number of groups that can be defined or their sizes (number of members). |
This is not a literal coming from the code or from API definitions, I'd just use emphasis rather than formatting as a literal
docs/user-guide/gateway-failover.md
Outdated
| All gateways in a fabric are members of, at least, one *default* group that always exists. | ||
|
|
||
| Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| All gateways in a fabric are members of, at least, one *default* group that always exists. | |
| Declaring gateway groups is done by means of the *GatewayGroup* object. The following sample snippet shows the declaration of a group called **group-1**. | |
| All gateways in a fabric are members of, at least, one `default` group that always exists. | |
| Declaring gateway groups is done by means of the `GatewayGroup` object. The following sample snippet shows the declaration of a group called `group-1`. |
Conversely, these are keywords or names used in the YAML, so I'd format as literals.
Same comment applies to the rest of the document.
| spec: | ||
| asn: 65534 | ||
| groups: | ||
| - name: default # all gateways belong to this group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean this field is mandatory? If so, add it in the comment? (“mandatory - all gateways belong ...”)
docs/user-guide/gateway-failover.md
Outdated
| !!! note | ||
| The value of the priority specified by a gateway within a group has no significance in absolute terms. For instance, configuring two gateways as members of the same group with priorities 200 and 100 has the same effect as configuring them with priorities 29 and 3. | ||
|
|
||
| The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the “not only...” brings any useful info here, I'd discard that part.
| The priorities within a group not only allow indicating a preference, but also making sure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future. | |
| Priorities within a group ensure that VPC peering traffic is correctly processed: since gateways implement services that are stateful, only one gateway within a group should handle a flow at a given point. This restriction may be lifted in the future. |
docs/user-guide/gateway-failover.md
Outdated
| !!! tip | ||
| Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups. | ||
|
|
||
| ## Fail-over behind the scenes and recommendations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Fail-over behind the scenes and recommendations | |
| ## Fail-over under the hood and recommendations |
I think “behind the scene” would rather mean “in secret”, “without the user noticing”
docs/user-guide/gateway-failover.md
Outdated
| Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows gracefully pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups. | ||
|
|
||
| ## Fail-over behind the scenes and recommendations | ||
| The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches) select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (e.g. leaf switches) select the gateway that will handle each packet based on those. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway. | |
| The gateway fail-over strategy in the Hedgehog Fabric is implemented in a distributed manner. Gateways announce VPC peering prefixes with the specified priorities, while edge nodes (such as leaf switches) select the gateway that handles each packet based on those prefixes and priorities. If a VPC peering refers to a group that has K members, the edge devices participating in the VPC will have K BGP routes (one per gateway) for each of the peering destinations. However, only one of those routes will be active at any point in time; the one advertised by (and pointing to) the preferred active gateway. |
docs/user-guide/gateway-failover.md
Outdated
| Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links. | ||
|
|
||
| !!! note "Takeaways and configuration summary" | ||
| Redundancy works out of the box in an active-backup fashion. Customizing it generally requires: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Redundancy works out of the box in an active-backup fashion. Customizing it generally requires: | |
| Redundancy works out of the box in an Active-Backup fashion. To customize redundancy's behavior: |
Case: update for consistency with the rest of the doc.
| 1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available. | ||
| 2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups. | ||
| 3. Mapping the VPC peerings to the groups defined. | ||
| 4. Enabling BFD on gateway links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. declaring gateway groups (**GatewayGroup** objects) depending on the number of gateways available. | |
| 2. Assigning gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups. | |
| 3. Mapping the VPC peerings to the groups defined. | |
| 4. Enabling BFD on gateway links. | |
| 1. Declare gateway groups (**GatewayGroup** objects) depending on the number of gateways available. | |
| 2. Assign gateways to the groups, with suitable priorities. For load sharing purposes, the recommendation is to assign a high priority to each gateway in at least one of the groups. | |
| 3. Map the VPC peerings to the groups defined. | |
| 4. Enable BFD on gateway links. |
|
|
||
| Because edge devices perform the fail-over when a preferred gateway ceases to be reachable, downtime on failure depends on how fast edge devices reckon the anomaly. To expedite the failure detection and minimize the volume of traffic blackholed, it is recommended to enable BFD on all gateway links. | ||
|
|
||
| !!! note "Takeaways and configuration summary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these ❤️
Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
2e2879c to
df0fc6c
Compare
Closes: #248
Unsure if this closes #249