Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuity of Service #43

Open
alexlovelltroy opened this issue Aug 23, 2024 · 0 comments
Open

Continuity of Service #43

alexlovelltroy opened this issue Aug 23, 2024 · 0 comments
Assignees
Labels
Partner Objective A broadly scoped objective that is important to a partner

Comments

@alexlovelltroy
Copy link
Member

alexlovelltroy commented Aug 23, 2024

High Availability or Continuity of Service

HPC System Management services need to be available and well supported in order to maintain the expectations of HPC users. Every minute that a system isn't available for science due to scheduled or unscheduled maintenance activities reduces the overall availability and efficiency of the system as a whole.

The purpose of this item is to describe how OpenCHAMI-based systems can remain available for science at all times, even when dealing with unexpected maintenance activities. We also need to describe the tradeoffs necessary for degraded operation.

Example: A service like cloud-init is only needed during node boots. If boot activities are rare and not urgent, a site may choose not to invest the resources necessary for a highly available cloud-init service. Instead, they may choose to invest in guaranteeing that a known-good cloud-init service can be started in less than 20 seconds.

High Availability: Traditional Approaches

There are several traditional routes to achieving continuity of service for the overall system.

  1. Heartbeat-style failover which ensures that a second system can take over all functions quickly in the event that the first system fails.

  2. Distributed Quorum services are the norm for cloud architectures. As long as enough servers agree on the overall state of the cluster, any of them may provide services to the compute nodes. Kubernetes uses a distributed quorum control plane.

NB CSM uses Kubernetes extensively for maintaining continuous operation of the control plane. This has led to the concern that HPC administrators must first be Kubernetes administrators in order to manage CSM-based control planes.

The microservice architecture of OpenCHAMI explicitly avoids assuming a particular deployment methodology. Sites have successfully run OpenCHAMI services as containers with Podman, Docker, and as Kubernetes pods. In addition, the services are independently tested as standalone binaries. It is easy to see these binaries running within systemd units.

Regardless of the deployment characteristics, the project needs to describe highly available operations. This objective seeks to identify the requirements for high availability and a reference implementation that meets the requirements. Changes to services in order to achieve high availability may be needed. However, flexibility has been a cornerstone of the project. High availability must remain optional and existing deployment characteristics must remain available.

References

@alexlovelltroy alexlovelltroy converted this from a draft issue Aug 23, 2024
@alexlovelltroy alexlovelltroy added the Partner Objective A broadly scoped objective that is important to a partner label Aug 23, 2024
@alexlovelltroy alexlovelltroy self-assigned this Aug 26, 2024
@alexlovelltroy alexlovelltroy moved this to In Progress in Roadmap Project Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Partner Objective A broadly scoped objective that is important to a partner
Projects
Status: In Progress
Development

No branches or pull requests

1 participant