Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFD] Remote Console Service #67

Open
evanmcc opened this issue Dec 9, 2024 · 2 comments
Open

[RFD] Remote Console Service #67

evanmcc opened this issue Dec 9, 2024 · 2 comments
Labels
rfd Request for Discussion

Comments

@evanmcc
Copy link

evanmcc commented Dec 9, 2024

OpenCHAMI does not at present have any opinion about low level system access like console services and power control. While it may work for some smaller systems to manage their remote console infrastructure manually, for larger systems it will be beneficial to have some easy-to-integrate solutions for these systems that are reasonably easy to run, respect sources of truth in the larger system, and update themselves automatically as the system changes over time.

What do you propose?

With the choice of SMD for hardware state storage and management, this makes it easier to pull over and modify higher-value CSM components like the console system and PCS, rather than implementing something from a blank slate.

As such, we propose pulling in a simplified and adapted version of the conmand-based CSM console service. The specific simplifications proposed are:

  • All parts of the service run in the same container.
  • The database dependency and system for dividing various nodes into many different containers will be dropped.
  • Middleware, metrics, releases, and logging will be brought into compliance with OpenCHAMI standards.

The primary reason for the simplification is that, by adding an additional source of truth, the existing console service over-complicates itself and introduces an additional source of state sync bugs. While the state of the system changes slowly and these bugs might be rare, the addition implementation complexity doesn't seem worth the effort for systems below 10k nodes. We would propose instead manual sharding by acting on failure domain information in SMD, allowing fine control without the unpredictability of shuffling nodes around. Once the scaling limits of the system are understood, failure domains can be assigned to individual instances of the service, allowing an instance to claim the nodes in that failure domain.

In addition to porting and simplifying, we would like to add a few features, most notably endpoints for quickly tailing logs and getting read/write access with a single command, rather than the existing setup, which requires finding the node, logging onto it, and then running conman commands on it.

What alternatives exist?

  1. XCat2's goconserver has quite a few features, but the code looks complex and I do not know of anyone who is using it, it would be good to find and talk to some users of this tool to better evaluate how their experience has been.

  2. Greenfield service: Since most modern BMCs allow remote console over ssh in addition to IPMI, it might be interesting to try to remove conman from the system entirely, simplifying the system still further. A simple prototype has been constructed, but concerns about backwards comparability and the level of effort involved in making the system talk to BMC and be a good native k8s citizen point towards adapting and simplifying the existing system. If there is time, conman could be removed and direct connections could be added to the simplified system.

  3. Sidecar service: this idea has two parts: fixing conmand to be able to re-read its configuration file on SIGHUP. Then we can cut the service down to something that just runs on the same container as conman and periodically updates its config file. This is appealing in that it is very simple, but it requires an unknown level of effort to fix the reload issues in conman.

Other considerations

Failure domains: Thus far there is no support for this. I think that for larger systems, adding support for failure domains is probably the right way to shard the system rather than trying to scale dynamically forever. Each console instance can "own" a failure domain, limiting its SMD queries to that domain, allowing smaller and simpler systems to run a single domain and console server, and larger ones to run several. This removed the need for dynamic lookup, since failure domains should be static and known, so operators will know what failure domain console server they need to look at.

Reconnection time: Due to its age, it seems that conmand requires a restart to re-read its configuration, which will mean that adding new targets will require a service restart, which could lead to downtime for logs. This is probably something that we can fix in conman, should it become a major issue for a user, or something we could potentially fix by having the service own its own connections and config, as mentioned in the greenfield service section.

Console file export: Most sites are going to want to, in addition to having site-local console files, export those files to some external system (e.g. loki, victorialogs) for alerting and analysis. There are a number of ways to do this and I am not sure if having an opinion here is in scope for OpenCHAMI.

Scaling: We need to better characterize the scaling limits of conman before we start relying on it at much higher scales. The data rates required of this service are relatively low, but there might be architecture issues causing it to not reach its scaling potential. We could fix these or perhaps this would be a push to the greenfield approach, which would make it easier to take advantage of additional cores more easily.

@evanmcc
Copy link
Author

evanmcc commented Dec 9, 2024

Note that NERSC needs this in the near term and is likely to move forward quickly on the basic proposal, focusing first on the bits that all approaches have in common (exterior packaging, testing, config reading, etc). Once a viable replacement service is in place and tested, we'll begin to work on the improvements the project feels are most important.

@alexlovelltroy alexlovelltroy added the rfd Request for Discussion label Dec 10, 2024
@Masber
Copy link

Masber commented Dec 11, 2024

Hi,

I work for CSCS, and I’m very interested in the initiative to port the current node console solution from CSM to OCHAMI.

From personal experience, we’ve been developing a CLI tool called Manta for CSM, and we’re currently integrating OCHAMI as an additional backend. One of the key features we offer is the ability to connect to a node's console through the conman container included in CSM, and we’d like to replicate this experience for clusters supported by OCHAMI.

Regarding the goal to:

In addition to porting and simplifying, we would like to add a few features, most notably endpoints ... and getting read/write access with a single command.

This may involve disabling the terminal driver on the client’s machine, similar to how kubectl exec -it or docker exec -it works, to ensure a seamless user experience.

I’d be happy to discuss this further and/or help testing this functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion
Projects
None yet
Development

No branches or pull requests

3 participants