-
Notifications
You must be signed in to change notification settings - Fork 67
TQ: Support adding sleds via trust quorum #9650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR introduces two new external APIs to allow adding multiple sleds to a rack
at once and to query status about the ongoing operation. Both are
currently experimental and live under `/v1/trust-quorum`. They need to
be moved under `/system/hardware` like the original `sled-add` command.
They also need to be reworked to not report trust quorum specific
details if it can be avoided. Most of that should be in omdb for
debugging. I may add some of that support to this PR.
This PR also introduces a background task for driving the trust quorum
reconfiguration to completion. Reconfiguration is driven by two steps.
Synchronously updating the DB in the new external endpoint handler and
then asynchronously trying to commit the operation via the background
task.
I tested this on a4x2 and it works as expected. See the trace from the original external API test below:
```
➜ oxide.rs git:(main) ✗ echo '{"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28", "sled_ids": [{"part": "PPP-PPPPPPP","serial": "00000000002"}]}' | target/debug/oxide --profile recovery api /v1/trust-quorum/new-members --method POST --input -
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": null,
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000001": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000002": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
},
"PPP-PPPPPPP:00000000003": {
"share_digest": null,
"state": "unacked",
"time_committed": null,
"time_prepared": null
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "preparing",
"threshold": 3,
"time_aborted": null,
"time_committed": null,
"time_committing": null,
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": null,
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": "fcfb09128c84d82cc81b200c6c682510f63160a4417856f4041b1886445e8b14",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.826622Z"
},
"PPP-PPPPPPP:00000000001": {
"share_digest": "d8cad02bd3bccd08109a79e3bf6d8dab0d460a0ba879bf42887dc0fc8d855786",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.848235Z"
},
"PPP-PPPPPPP:00000000002": {
"share_digest": "dd57ad8e271734fabfe97d6180d6da3e5c3805e17dacf58e0f2a6d5ed7f1242b",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.806644Z"
},
"PPP-PPPPPPP:00000000003": {
"share_digest": "6b27327ca49976ccca83972e6578ef195c99489e62811e8d0a0cb061fca9c0c4",
"state": "prepared",
"time_committed": null,
"time_prepared": "2026-01-14T21:32:55.837154Z"
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "preparing",
"threshold": 3,
"time_aborted": null,
"time_committed": null,
"time_committing": null,
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api /v1/trust-quorum/config/latest/0dbef452-a6dd-4831-bbdc-769ea3353f28
{
"abort_reason": null,
"commit_crash_tolerance": 1,
"coordinator": {
"part_number": "PPP-PPPPPPP",
"serial_number": "00000000003"
},
"encrypted_rack_secrets": {
"data": "53de7731deec3f298a7f5067e256a63bb2869a91c9710d9b23dbf3d261d1b730039d9cb11b543c14906ff77cd409d32953959e9ff8933858",
"salt": "ec609ed5ff7aee94e2e88ad94af56e0cbb8a66a683294005c7888f60a627956a"
},
"epoch": 2,
"last_committed_epoch": 1,
"members": {
"PPP-PPPPPPP:00000000000": {
"share_digest": "fcfb09128c84d82cc81b200c6c682510f63160a4417856f4041b1886445e8b14",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.826622Z"
},
"PPP-PPPPPPP:00000000001": {
"share_digest": "d8cad02bd3bccd08109a79e3bf6d8dab0d460a0ba879bf42887dc0fc8d855786",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.848235Z"
},
"PPP-PPPPPPP:00000000002": {
"share_digest": "dd57ad8e271734fabfe97d6180d6da3e5c3805e17dacf58e0f2a6d5ed7f1242b",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.806644Z"
},
"PPP-PPPPPPP:00000000003": {
"share_digest": "6b27327ca49976ccca83972e6578ef195c99489e62811e8d0a0cb061fca9c0c4",
"state": "committed",
"time_committed": "2026-01-14T21:33:03.864617Z",
"time_prepared": "2026-01-14T21:32:55.837154Z"
}
},
"rack_id": "0dbef452-a6dd-4831-bbdc-769ea3353f28",
"state": "committed",
"threshold": 3,
"time_aborted": null,
"time_committed": "2026-01-14T21:33:04.652543Z",
"time_committing": "2026-01-14T21:32:55.861158Z",
"time_created": "2026-01-14T21:32:18.780136Z"
}
➜ oxide.rs git:(main) ✗
```
04173b6 to
2d8fc67
Compare
| /// testing. | ||
| pub const TRUST_QUORUM_INTEGRATION_ENABLED: bool = false; | ||
| //pub const TRUST_QUORUM_INTEGRATION_ENABLED: bool = false; | ||
| pub const TRUST_QUORUM_INTEGRATION_ENABLED: bool = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset this before merge.
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a bunch of nits and small suggestions
| async fn trust_quorum_get_latest_config( | ||
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackPath>, | ||
| ) -> Result<HttpResponseOk<Option<RackMembershipChange>>, HttpError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to defer to others with more opinions on the external API, but a few questions based mostly on the update sync this week:
- Will the
trust_quorum_...bits of these names leak out into the OpenAPI spec? (I think so, because we default the operation ID to match the method name?) - I'm a little surprised this is a "get latest" and not "get the result of an operation I started", but maybe I misunderstood? I thought we wanted something like "add sled is async, and returns an identifier that can be used to check the progress of the operation".
- Are all the fields of
RackMembershipChangemeaningful to an operator? (I'm mostly squinting atepoch, but maybe that's closely related to the previous bullet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so I left the method names the same, but changed the endpoints. I did not realize that the method names leaked.
I really like the return a token / identifier mechanism in general, and it could possibly work here. However there are two wrinkles:
- The initial request could timeout and the trust quorum reconfiguration could be in progress or even committed by the time the user polled. What would they poll with if they didn't get back the token? I suppose they could ask for a token for the latest configuration, but that brings up the next point.
- There's an inherent TOCTTOU here, where the user can see that their configuration committed but another user could have started a later one. Maybe that's not a problem and the user is only concerned about their own and can always ask for a token back.
While writing this, I think you sold me on the token idea. The user will submit a request and get back the epoch as the token for the configuration. Then they will poll that epoch. We will also provide another api to get the latest epoch.
I think this all takes me to your last question. In this case the epoch is the identifier to know which configuration a user is dealing with. I could change this to version or generation in the API and map that to an Epoch, but I think that will generally make things more confusing for support. I personally just want to call it epoch, and not try to map the same concept to a different word :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed up in 9832d33
|
Thanks for the comprehensive review @jgallagher. I think I fixed up everything. Let me know if you see anything else! |
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, just a couple minor style suggestions I didn't notice the first time around.
Happy to approve with a couple caveats:
- Your comment about resetting
TRUST_QUORUM_INTEGRATION_ENABLEDstill needs to be applied - Would strongly prefer an external-API-focused set of eyes to review those bits (maybe @ahl?)
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackPath>, | ||
| req: TypedBody<params::AddSledsRequest>, | ||
| ) -> Result<HttpResponseOk<Epoch>, HttpError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the type names leak out into the API doc? If so, maybe RackMembershipEpoch would be more clear? If not, ignore this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No leaking as far as I can tell.
"properties": {
"epoch": {
"description": "The generation / version of the configuration",
"type": "integer",
"format": "uint64",
"minimum": 0
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah--in this case the type is #[serde(transparent)] so it just looks like a number. However, @andrewjstone, I think you're looking at the wrong place in the OpenAPI output:
"responses": {
"200": {
"description": "successful operation",
"content": {
"application/json": {
"schema": {
"title": "uint64",
"type": "integer",
"format": "uint64",
"minimum": 0
}
}
}It is at least a little weird (for us) to be returning a literal integer rather than, say, an object with a field that is an integer.
|
Tested in a4x2 with all the latest changes. I even was able to catch the bg task doing something in OMDB: All that remains is for someone to take a look over the external API. |
ahl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've added a bunch of quibbling comments that reflect a lack of my own understanding about how this is intended to be exposed to and used by customers. If there's something I should read or if you have some time to talk, I'd love to understand this better.
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackPath>, | ||
| req: TypedBody<params::AddSledsRequest>, | ||
| ) -> Result<HttpResponseOk<Epoch>, HttpError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah--in this case the type is #[serde(transparent)] so it just looks like a number. However, @andrewjstone, I think you're looking at the wrong place in the OpenAPI output:
"responses": {
"200": {
"description": "successful operation",
"content": {
"application/json": {
"schema": {
"title": "uint64",
"type": "integer",
"format": "uint64",
"minimum": 0
}
}
}It is at least a little weird (for us) to be returning a literal integer rather than, say, an object with a field that is an integer.
| /// Add new sleds to rack cluster | ||
| /// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Add new sleds to rack cluster | |
| /// | |
| /// Add new sleds to rack cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: I'm not clear on how this would/could/might work in a regional cluster (i.e. single pool of allocatable resources) multiple rack environment. Not that this needs to be solved here, but have you considered how it might be extended? Would we imagine a "cluster" is a "rack cluster" or a "region cluster" or ... something else?
Most tangibly: is there any doc or name change that would be slightly more future-proofed or are we close enough?
| async fn rack_membership_config( | ||
| rqctx: RequestContext<Self::Context>, | ||
| path_params: Path<params::RackMembershipConfigPathParams>, | ||
| ) -> Result<HttpResponseOk<Option<RackMembershipChange>>, HttpError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns an Option which is... a little weird, but maybe ok. What does a null return indicate to the user? Would an error response be more comprehensible?
| #[derive(Deserialize, JsonSchema)] | ||
| pub struct RackMembershipConfigPathParams { | ||
| pub rack_id: Uuid, | ||
| pub epoch: Epoch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would this input come from? I'm not clear on how rack_membership_config vs. rack_membership_config_latest would be used.
| impl From<TrustQuorumConfig> for RackMembershipChange { | ||
| fn from(value: TrustQuorumConfig) -> Self { | ||
| // `Unacked` means that a member has not received and acked a `Prepare` | ||
| // yet. `Prepared` means that a member has acknolwedged the prepare but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // yet. `Prepared` means that a member has acknolwedged the prepare but | |
| // yet. `Prepared` means that a member has acknowledged the prepare but |
| pub state: RackMembershipChangeState, | ||
| /// All members of the rack cluster for this epoch | ||
| pub members: BTreeSet<BaseboardId>, | ||
| /// All members which have not committed to the membership change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// All members which have not committed to the membership change | |
| /// All members that have not committed to the membership change |
I'm not clear on what this means--is there something approximating customer-facing docs I could read that might help me understand the nouns and verbs here?
| Aborted, | ||
| } | ||
|
|
||
| /// Status of last membership change from adding or removinig sleds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Status of last membership change from adding or removinig sleds | |
| /// Status of last membership change from adding or removing sleds |
| tags = ["experimental"], | ||
| versions = VERSION_TRUST_QUORUM_ADD_SLEDS_AND_GET_LATEST_CONFIG.. | ||
| }] | ||
| async fn rack_membership_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation name seems incongruous with the doc comment above. "config" to me says "show me the state of the world". "retrieve the change" says to me "show some delta"
| let config = | ||
| nexus.datastore().tq_get_config(&opctx, rack_id, epoch).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t might make more sense to return an 4xx class error for a None value of config.
This PR introduces two new external APIs to allow adding multiple sleds to a rack at once and to query status about the ongoing operation. It also adds an omdb command for more detailed status. Much more omdb to come in the near future.
This PR also introduces a background task for driving the trust quorum reconfiguration to completion. Reconfiguration is driven by two steps. Synchronously updating the DB in the new external endpoint handler and then asynchronously trying to commit the operation via the background task.
I tested this on a4x2 and it works as expected. See the trace from the original external API test below: