You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HPC System Management services need to be available and well supported in order to maintain the expectations of HPC users. Every minute that a system isn't available for science due to scheduled or unscheduled maintenance activities reduces the overall availability and efficiency of the system as a whole.
The purpose of this item is to describe how OpenCHAMI-based systems can remain available for science at all times, even when dealing with unexpected maintenance activities. We also need to describe the tradeoffs necessary for degraded operation.
Example: A service like cloud-init is only needed during node boots. If boot activities are rare and not urgent, a site may choose not to invest the resources necessary for a highly available cloud-init service. Instead, they may choose to invest in guaranteeing that a known-good cloud-init service can be started in less than 20 seconds.
High Availability: Traditional Approaches
There are several traditional routes to achieving continuity of service for the overall system.
Heartbeat-style failover which ensures that a second system can take over all functions quickly in the event that the first system fails.
Distributed Quorum services are the norm for cloud architectures. As long as enough servers agree on the overall state of the cluster, any of them may provide services to the compute nodes. Kubernetes uses a distributed quorum control plane.
NB CSM uses Kubernetes extensively for maintaining continuous operation of the control plane. This has led to the concern that HPC administrators must first be Kubernetes administrators in order to manage CSM-based control planes.
The microservice architecture of OpenCHAMI explicitly avoids assuming a particular deployment methodology. Sites have successfully run OpenCHAMI services as containers with Podman, Docker, and as Kubernetes pods. In addition, the services are independently tested as standalone binaries. It is easy to see these binaries running within systemd units.
Regardless of the deployment characteristics, the project needs to describe highly available operations. This objective seeks to identify the requirements for high availability and a reference implementation that meets the requirements. Changes to services in order to achieve high availability may be needed. However, flexibility has been a cornerstone of the project. High availability must remain optional and existing deployment characteristics must remain available.
High Availability or Continuity of Service
HPC System Management services need to be available and well supported in order to maintain the expectations of HPC users. Every minute that a system isn't available for science due to scheduled or unscheduled maintenance activities reduces the overall availability and efficiency of the system as a whole.
The purpose of this item is to describe how OpenCHAMI-based systems can remain available for science at all times, even when dealing with unexpected maintenance activities. We also need to describe the tradeoffs necessary for degraded operation.
High Availability: Traditional Approaches
There are several traditional routes to achieving continuity of service for the overall system.
Heartbeat-style failover which ensures that a second system can take over all functions quickly in the event that the first system fails.
Distributed Quorum services are the norm for cloud architectures. As long as enough servers agree on the overall state of the cluster, any of them may provide services to the compute nodes. Kubernetes uses a distributed quorum control plane.
The microservice architecture of OpenCHAMI explicitly avoids assuming a particular deployment methodology. Sites have successfully run OpenCHAMI services as containers with Podman, Docker, and as Kubernetes pods. In addition, the services are independently tested as standalone binaries. It is easy to see these binaries running within systemd units.
Regardless of the deployment characteristics, the project needs to describe highly available operations. This objective seeks to identify the requirements for high availability and a reference implementation that meets the requirements. Changes to services in order to achieve high availability may be needed. However, flexibility has been a cornerstone of the project. High availability must remain optional and existing deployment characteristics must remain available.
References
The text was updated successfully, but these errors were encountered: