Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: properly support the PXE and ISO machines in the secure tokens flow #974

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Unix4ever
Copy link
Member

The unique token flow was reworked to support machines running from PXE and ISO.

As they do not support META persistence, Omni doesn't enforce secure tokens for them.
But to distinguish machines and make the UUID conflict resolution to work, Omni now calculates the node fingerprints out of the mac addresses of all physical interfaces on the node.

So now each unique token consists of two parts:

  • fingerprint.
  • a random string.

Omni detects Talos installation on the machine in the following way:

  • check if the pending machine status exists and it detected the system disk.
  • overwrite the previous check if the existing link was labeled with the Talos being installed.
  • lastly if the MachineStatus exists, overwrite all checks with the installed label from it (ensures bare-metal provider workflow which goes to installed to not installed and PXE booted).

Then when a machine joins Omni with some token, Omni checks if the random part is equal. If it is equal, the machine is immediately accepted.

If the random part is different and the fingerprint matches:

  • if Talos is installed - reject the machine and log the warning in the logs.
  • if Talos is not installed - replace the existing link with the new one (only if the request has a valid join token).

Then if nothing matches, the UUID conflict resolution kicks in. Provisioner creates a PendingMachine which is marked with UUID conflict label and PendingMachineStatus controller generates a random UUID for the node.

Fixes: #944

@Unix4ever Unix4ever added the integration/e2e Triggers all e2e tests for Omni label Feb 27, 2025
The unique token flow was reworked to support machines running from PXE
and ISO.

As they do not support META persistence, Omni doesn't enforce secure
tokens for them.
But to distinguish machines and make the UUID conflict resolution to work,
Omni now calculates the node fingerprints out of the mac addresses of
all physical interfaces on the node.

So now each unique token consists of two parts:

- fingerprint.
- a random string.

Omni detects Talos installation on the machine in the following way:

- check if the pending machine status exists and it detected the system
  disk.
- overwrite the previous check if the existing link was labeled with the
  Talos being installed.
- lastly if the `MachineStatus` exists, overwrite all checks with the
  installed label from it (ensures bare-metal provider workflow which
  goes to installed to not installed and PXE booted).

Then when a machine joins Omni with some token, Omni checks if the
random part is equal. If it is equal, the machine is immediately
accepted.

If the random part is different and the fingerprint matches:
- if Talos is installed - reject the machine and log the warning in the
  logs.
- if Talos is not installed - replace the existing link with the new one
  (only if the request has a valid join token).

Then if nothing matches, the UUID conflict resolution kicks in.
Provisioner creates a `PendingMachine` which is marked with UUID
conflict label and `PendingMachineStatus` controller generates a random
UUID for the node.

Signed-off-by: Artem Chernyshev <[email protected]>
@Unix4ever Unix4ever force-pushed the secure-tokens-pxe-iso branch from ca8362a to f77cd97 Compare February 28, 2025 09:21
hasValidNodeUniqueToken bool
nodeUniqueTokensEnabled bool
forceValidNodeUniqueToken bool
supportsSecureJoinTokens bool
}

// isAuthorized returns true if the request has valid creds for at least one of the auth flows.
func (pc *provisionContext) isAuthorized() bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can split to smaller logical parts related to each action we do

}
}

machineStatus, err := safe.ReaderGetByID[*system.ResourceLabels[*omni.MachineStatus]](ctx, h.state, req.NodeUuid)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use InfraMachineStatus last wipe time?

Change the infra machine status resource to have the last wiped token value.
Populate it this way Link -> InfraMachine -> InfraMachineStatus.

// - has matching fingerprint and Talos is not installed.
//
//nolint:gocognit
func ensureResource[T res](ctx context.Context, logger *zap.Logger,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to split it to parts

Comment on lines +294 to +298
if linkToken.IsSameFingerprint(provisionContext.requestNodeUniqueToken) {
logger.Warn("machine connection rejected: the machine has the correct fingerprint, but the random token part doesn't match")

return nil, status.Error(codes.PermissionDenied, "unauthorized")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move this part inside the ensureResource method probably.

continue
}

macAddresses = append(macAddresses, string(link.TypedSpec().PermanentAddr))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be empty, if not set use HardwareAddr.

Comment on lines +112 to +128
macAddresses := make([]string, 0, links.Len())

for link := range links.All() {
if !link.TypedSpec().Physical() {
continue
}

macAddresses = append(macAddresses, string(link.TypedSpec().PermanentAddr))
}

slices.Sort(macAddresses)

hash := sha256.New()

for _, addr := range macAddresses {
hash.Write([]byte(addr))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
macAddresses := make([]string, 0, links.Len())
for link := range links.All() {
if !link.TypedSpec().Physical() {
continue
}
macAddresses = append(macAddresses, string(link.TypedSpec().PermanentAddr))
}
slices.Sort(macAddresses)
hash := sha256.New()
for _, addr := range macAddresses {
hash.Write([]byte(addr))
}
macAddresses := make([][]byte, 0, links.Len())
for link := range links.All() {
if !link.TypedSpec().Physical() {
continue
}
macAddresses = append(macAddresses, link.TypedSpec().PermanentAddr)
}
slices.SortFunc(macAddresses. bytes.Compare)
hash := sha256.New()
for _, addr := range macAddresses {
hash.Write(addr)
}

pointer.SafeDeref(req.TalosVersion),
)
}

return h.updateLink(ctx, provisionContext)
return h.provision(ctx, provisionContext)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanitize errors returned here.

@@ -198,7 +199,7 @@ func HandleInput[T generic.ResourceWithRD, S generic.ResourceWithRD](ctx context

// GetTalosClient for the machine id.
// Automatically pick secure or insecure client.
func GetTalosClient[T generic.ResourceWithRD](ctx context.Context, r controller.Reader, address string, machine T) (*client.Client, error) {
func GetTalosClient[T generic.ResourceWithRD](ctx context.Context, r controller.Reader, address string, machineResource T) (*client.Client, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func GetTalosClient[T generic.ResourceWithRD](ctx context.Context, r controller.Reader, address string, machineResource T) (*client.Client, error) {
func GetTalosClient[T interface {
*V
generic.ResourceWithRD
}, V any](ctx context.Context, r controller.Reader, address string, machineResource T) (*client.Client, error) {

@@ -212,11 +213,20 @@ func GetTalosClient[T generic.ResourceWithRD](ctx context.Context, r controller.
)...)
}

if reflect.ValueOf(machine).IsNil() {
if reflect.ValueOf(machineResource).IsNil() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if reflect.ValueOf(machineResource).IsNil() {
if machineResource == nil {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration/e2e Triggers all e2e tests for Omni status/ok-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support PXE booted machine and machines running from the ISO
3 participants