Continuity of Service #43

alexlovelltroy · 2024-08-23T15:54:38Z

High Availability or Continuity of Service

HPC System Management services need to be available and well supported in order to maintain the expectations of HPC users. Every minute that a system isn't available for science due to scheduled or unscheduled maintenance activities reduces the overall availability and efficiency of the system as a whole.

The purpose of this item is to describe how OpenCHAMI-based systems can remain available for science at all times, even when dealing with unexpected maintenance activities. We also need to describe the tradeoffs necessary for degraded operation.

Example: A service like cloud-init is only needed during node boots. If boot activities are rare and not urgent, a site may choose not to invest the resources necessary for a highly available cloud-init service. Instead, they may choose to invest in guaranteeing that a known-good cloud-init service can be started in less than 20 seconds.

High Availability: Traditional Approaches

There are several traditional routes to achieving continuity of service for the overall system.

Heartbeat-style failover which ensures that a second system can take over all functions quickly in the event that the first system fails.
Distributed Quorum services are the norm for cloud architectures. As long as enough servers agree on the overall state of the cluster, any of them may provide services to the compute nodes. Kubernetes uses a distributed quorum control plane.

NB CSM uses Kubernetes extensively for maintaining continuous operation of the control plane. This has led to the concern that HPC administrators must first be Kubernetes administrators in order to manage CSM-based control planes.

The microservice architecture of OpenCHAMI explicitly avoids assuming a particular deployment methodology. Sites have successfully run OpenCHAMI services as containers with Podman, Docker, and as Kubernetes pods. In addition, the services are independently tested as standalone binaries. It is easy to see these binaries running within systemd units.

Regardless of the deployment characteristics, the project needs to describe highly available operations. This objective seeks to identify the requirements for high availability and a reference implementation that meets the requirements. Changes to services in order to achieve high availability may be needed. However, flexibility has been a cornerstone of the project. High availability must remain optional and existing deployment characteristics must remain available.

References

alexlovelltroy added this to Roadmap Project Aug 23, 2024

alexlovelltroy converted this from a draft issue Aug 23, 2024

alexlovelltroy added the Partner Objective A broadly scoped objective that is important to a partner label Aug 23, 2024

alexlovelltroy self-assigned this Aug 26, 2024

alexlovelltroy moved this to In Progress in Roadmap Project Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuity of Service #43

Continuity of Service #43

alexlovelltroy commented Aug 23, 2024 •

edited

Loading

Continuity of Service #43

Continuity of Service #43

Comments

alexlovelltroy commented Aug 23, 2024 • edited Loading

High Availability or Continuity of Service

High Availability: Traditional Approaches

References

alexlovelltroy commented Aug 23, 2024 •

edited

Loading