Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFD] Support for Power Management #64

Open
cjh1 opened this issue Nov 8, 2024 · 2 comments
Open

[RFD] Support for Power Management #64

cjh1 opened this issue Nov 8, 2024 · 2 comments
Labels
rfd Request for Discussion

Comments

@cjh1
Copy link
Member

cjh1 commented Nov 8, 2024

Support for Power Management

This RFD lays out a path to adding power management to OpenCHAMI.

What needs to change:

Currently OpenCHAMI doesn't have a way to perform basic power management operations such as powering on a set of nodes. A well integrated power management service is an important component of a system management platform.

What do you propose?

The proposal is to bring the existing Cray-HPE Power Control Service into the OpenCHAMI project.

Starting from an existing code base that integrates with SMD seems like a more pragmatic approach, rather than starting from scratch. It also has the advantage for those that have existing Cray-HPE hardware that can reuse integrations with the existing PCS API. In general, it seems that the PCS API is pretty functional, and many of the issues discussed below are the result of the implementation of the command line tools that use the PCS API, rather than deficiencies in the service itself.

Inline with the transition of SMD to the OpenCHAMI project. The following set of changes would performed initially:

  • The vendor directory will be removed and the go version will be updated.
  • The release handling will be updated to use goreleaser and publish containers to ghcr.io.
  • The mux router will be switched out for chi to be consistent with other OpenCHAMI codebases.

PCS and its tooling do have some pain points that will serve as a bug/feature list for future development.

Here are a few of the top issues raised by NERSC staff:

  • Quite frequently, the API reports success ( HTTP 200 ) but there is an error talking to redfish and the underlying failure is not propagated back to the operator. In the case of SLURM, daemon retry logic has been added to try to overcome this flakiness. Sometimes operators have to call the redfish interface directly, but this is rare.

    PCS is 'imperative' (go do this action) by design rather than 'declarative' (maintain this state). So it's unlikely that we would add this sort of retry logic to PCS. However, ensuring that errors are correctly propagated to the API would allow other tools to be built with a more declarative view of the system.

  • When interacting with BMCs in any form (via Redfish or IPMI or whatever) they don't always listen to you the first time. Some implementations of power control will re-send the same request several times and ask the BMC what it thinks happened.

    This is somewhat similar to the previous point and is probably out of scope in terms of PCS. However, PCS needs to provide accurate information in terms of how the BMCs respond to requests.

  • The PCS’s view of what can be capped is incorrect sometimes. Do we have more details on this?

The following fit more into the category of feature requests:

  • Progress tracking

    • The API currently provides an id for each transition that then has to be polled with another command invocation to check the status. This is somewhat cumbersome for operators, it would be good to have "execute and monitor" mode. This would include progress bar features to get an idea of what percentage of nodes successfully booted/failed/in-progress without details of which specific nodes. The state in time output of cray power transition describe which you have to scroll through is not very useful for such high level monitoring.

    The presentation of progress is probably out of scope of PCS. However, providing an event stream associated with transition would allow us to write more useful tools that could provide this sort of progress information without the need to resort to polling. One possible approach would be to add SSE or websockets to the API to allow a client to subscribe to specific events.

  • Retry logic on server side

    • Implement a queue and retry logic on the PCS service side.

    This is probably outside the scope of PCS as it was designed. However, it could be implemented by a service built on top of PCS.

  • Queuing of transitions

    • Currently if a transition is issued and a transition is already in progress the request is rejected ( Looking at the code this shouldn’t happen as it should lock the components with reservations? ). This can happen for example if a SLURM command has been issued and one or more operators also issues a command. These transitions/requests could be queued allowing the transition/request to be serialized. Operators would need the ability to view the queue of transitions/requests on a per node basis.

    More investigation needs to be performed to understand how queue / serialization of transition could be performed.

Longer terms goals

Transition to a cell-based architecture

In line with wider discussions ( #41 ) across the collaboration we should look at how we could transition away from a single PCS instance, to multiple independent instances, for example one instance per cabinet, thus reduce the size of the failure domain. Given the imperative nature of PCS it should be amenable to a cellular deployment.

Transition away from TRS

PCS currently uses HMS Task Runner Service (TRS) to parallelize operations, for example sending requests to BMCs. It uses Kafka to queue tasks that can then be processed by workers. TRS doesn't seem to be under active development. Given this it would be a good idea to move to a community supported alternative, of which there are many. Here are just a few:

  • asynq
  • machinery ( not sure how active this is )
  • taskq
  • river ( a relatively new one )

An analysis will need to be performed to select an alternative that matches the needs of PCS. Moving away from TRS would allow us to leverage the features of a modern task queue and reduce the maintenance burden of having to maintain TRS along with PCS. TRS also has a "local" mode that uses goroutines, this may be enough to support the requests generated by PCS and would reduce the amount of TRS code that would need to be maintained.

Looks at moving to PostgreSQL for state storage

OpenCHAMI's SMD implementation transitioned away from etcd as its persistent backend store as etcd was a big contributor to unplanned outages at LANL. This has not been our experience with PCS at NERSC. However, looking at the implementation of the storage provider for PCS, it does look like it would be amenable to a relational implementation if this was necessary. Another approach that might be worth considering is to use node-local storage with snapshotting like the experiments that have been implemented in (https://github.com/OpenCHAMI/quack). This might fit nicely given that the power control state can be regenerated relatively easily.

Operator facing tools

PCS provides an API that can be used to build operator facing tools needed to perform power transitions. cray power is one of the current clients of PCS. Many of the issues/feature requests raised would be implemented in client tools. The intent would be to implement a new command line interface to PCS that addresses these needs. Another RFD would be submitted to provide a detailed discussion of such a tool.

What alternatives exist?

  • Decide that power management is out of scope for OpenCHAMI and recommend integration with other tools a such as:

    • powerman
    • xCat (rpower)
    • IPMI

    The downside of using these external tools is they lack the integration with SMD, for example creating a reservation for nodes that are being shutdown.

  • Start from scratch and implement a new microservice from the ground up. This would avoid carrying any technical debt from PCS, however, it would involve significant development effort.

@cjh1 cjh1 added the rfd Request for Discussion label Nov 8, 2024
@cjh1 cjh1 mentioned this issue Nov 8, 2024
@alexlovelltroy
Copy link
Member

This is great! I have a few comments.

  1. Scoping of PCS as a purely imperative server is appropriate for an initial attempt to add the service to OpenCHAMI and preserve some set of backwards compatibility, but we should open the discussion of how/if we would like it to evolve to encompass more scope. One of the issues we've seen with CSM is that the proliferation of microservices makes it hard to make a change because of the many microservices involved. If an individual module/microservice/etc isn't valuable alone, it may be too narrowly scoped.

  2. Be careful of replacing TRS with another task queuing system if we can handle the expected scale purely with goroutines. The best way to address distributed systems problems is to avoid them.

  3. On the LANL side, we're exploring some additional client paradigms that we haven't published yet. They kindof look like the kubectl ability to extend the cli with additional modules as needed. We should collaborate on an RFD to describe it when the time comes.

@cjh1
Copy link
Member Author

cjh1 commented Nov 8, 2024

  1. Scoping of PCS as a purely imperative server is appropriate for an initial attempt to add the service to OpenCHAMI and preserve some set of backwards compatibility, but we should open the discussion of how/if we would like it to evolve to encompass more scope. One of the issues we've seen with CSM is that the proliferation of microservices makes it hard to make a change because of the many microservices involved. If an individual module/microservice/etc isn't valuable alone, it may be too narrowly scoped.

Absolutely, I was just trying to keep the scope of this RFD manageable, but I think you make a good point.

  1. Be careful of replacing TRS with another task queuing system if we can handle the expected scale purely with goroutines. The best way to address distributed systems problems is to avoid them.

Agreed, the best option would be to use goroutines. The first step will be to check if they can support the load, if they can we are done.

  1. On the LANL side, we're exploring some additional client paradigms that we haven't published yet. They kindof look like the kubectl ability to extend the cli with additional modules as needed. We should collaborate on an RFD to describe it when the time comes.

Yes, I saw that you had started to add some CLIs. Having a single command with subcommands/modules is a good approach. I am a fan of Typer and have used it to build CLIs like that before. Would be happy to collaborate on RFD for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion
Projects
None yet
Development

No branches or pull requests

2 participants