Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16038 pool: Enable pool list for clients #14575

Closed
wants to merge 11 commits into from
Closed

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Jun 13, 2024

Revive parts of the old client mgmt API to allow
libdaos clients to list pools in the system, but
only allow a given client to see the pools that
it could connect to.

Copy link

github-actions bot commented Jun 13, 2024

Bug-tracker data:
Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/

@mjmac mjmac changed the title mjmac/DAOS 15982 DAOS 15982 pool: Allow libdaos clients to list pools with access Jun 13, 2024
src/engine/drpc_client.c Show resolved Hide resolved
src/mgmt/srv.c Outdated Show resolved Hide resolved
src/mgmt/srv.c Show resolved Hide resolved
src/pool/rpc.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Revive parts of the old client mgmt API to allow
libdaos clients to list pools in the system, but
only allow a given client to see the pools that
it could connect to.

Required-githooks: true
Change-Id: I2b3c391ddf042b23811be8f3390ec290e92e4290
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 43c73cc to e7db73a Compare June 15, 2024 15:49
@mjmac mjmac changed the title DAOS 15982 pool: Allow libdaos clients to list pools with access DAOS-15982 pool: Enable pool list for clients Jun 15, 2024
@daosbuild1
Copy link
Collaborator

@mjmac
Copy link
Contributor Author

mjmac commented Jun 15, 2024

Tagging early reviewers to validate the approach before I proceed:

  • @mchaarawi: I un-deprecated the list pools API for clients, is this OK, from a compatibility standpoint?
  • @kccain, @liw: Please take a look at how I implemented dc_mgmt_pool_list ... I know that the new MS isn't a proper rsvc, but it seems reasonable to me that we could make it behave like one so that we can leverage the rsvc_client functionality.
  • @kccain, @liw (, others? @johannlombardi?): Please take a look at how I added the new pool access check RPC. If this looks reasonable, I can refactor the connect handler to use the shared helper.
  • @kjacque (and @dpquigl if available for reviews): Please take a look at the credential management. I followed established patterns, so hopefully what I've done in this patch makes sense.

There is still some work to be done to enable an optional pool query for each listed pool, for parity with the dmg command. I also plan to add ftest coverage for the new daos pool list command. I'll work on that once I get a greenlight on the overall approach.

For the moment, I've got this work based on the google/2.4 branch, but I will plan to get it rebased on master for 2.8, and possibly backported for 2.6.1 if there are no compatibility issues.

@mjmac
Copy link
Contributor Author

mjmac commented Jun 15, 2024

Some example output, showing that dmg can list all pools, since it is an admin command, but the daos tool can only list pools that would allow the user to connect:

(rocky-8)mjmac@mjmac-daos-dev00:~/projects/daos$ dmg pool list
Pool      Size   State Used Imbalance Disabled
----      ----   ----- ---- --------- --------
offlimits 3.0 GB Ready 41%  0%        0/24
tank      40 GB  Ready 7%   0%        0/24

(rocky-8)mjmac@mjmac-daos-dev00:~/projects/daos$ dmg pool get-acl offlimits
# Owner: mjmac@
# Owner Group: primarygroup@
# Entries:
#   None
(rocky-8)mjmac@mjmac-daos-dev00:~/projects/daos$ daos pool list
Pool Size State Used Imbalance Disabled
---- ---- ----- ---- --------- --------
tank 0 B  Ready 0%   0%        0/0

The pool access filtering is done on the engine side, as the client environment is untrusted, and clients without connection access to a pool shouldn't know anything about its details.

@mjmac mjmac changed the title DAOS-15982 pool: Enable pool list for clients DAOS-16038 pool: Enable pool list for clients Jun 15, 2024
@kjacque kjacque self-requested a review June 18, 2024 00:46
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked closely at the credential usage and skimmed the rest. This approach looks good to me.

@liw
Copy link
Contributor

liw commented Jun 19, 2024

Some example output, showing that dmg can list all pools, since it is an admin command, but the daos tool can only list pools that would allow the user to connect:

Since both daos and libdaos depends on daos_agent, have we considered letting the latter forward pool list requests to the MS?

@mjmac
Copy link
Contributor Author

mjmac commented Jun 19, 2024

Some example output, showing that dmg can list all pools, since it is an admin command, but the daos tool can only list pools that would allow the user to connect:

Since both daos and libdaos depends on daos_agent, have we considered letting the latter forward pool list requests to the MS?

@liw: Yep, I did consider that approach, but please note that for each pool in the system, we need to perform an ACL check using the client's credential. If the client could connect to that pool, it's included in the list returned to the client.

The logic for performing this access check is implemented in the engine, so if we used the agent as a relay, we would have to go (repeating the MS/PS calls for each pool via dRPC from daos_server):
client <-dRPC-> agent <-gRPC-> MS [<-dRPC-> engine <-CaRT-> PS]

The approach proposed in this patch looks like this (repeating the PS call for each pool from the MS rank):
client <-CaRT-> engine (MS rank, 1 dRPC upcall to get pool list) [<-CaRT-> PS]

Given that the majority of the work that needs to be done for this functionality happens in the engine, it seemed much more efficient to keep most of the logic at this level. Maybe I'm missing something, though.

Copy link
Contributor

@johannlombardi johannlombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks go to me.

@liw
Copy link
Contributor

liw commented Jun 23, 2024

@liw: Yep, I did consider that approach, but please note that for each pool in the system, we need to perform an ACL check using the client's credential. If the client could connect to that pool, it's included in the list returned to the client.

The logic for performing this access check is implemented in the engine, so if we used the agent as a relay, we would have to go (repeating the MS/PS calls for each pool via dRPC from daos_server): client <-dRPC-> agent <-gRPC-> MS [<-dRPC-> engine <-CaRT-> PS]

The approach proposed in this patch looks like this (repeating the PS call for each pool from the MS rank): client <-CaRT-> engine (MS rank, 1 dRPC upcall to get pool list) [<-CaRT-> PS]

Given that the majority of the work that needs to be done for this functionality happens in the engine, it seemed much more efficient to keep most of the logic at this level. Maybe I'm missing something, though.

Right, the "constraint of the design"... Is that an external requirement for the solution or one that is required by the current approach? (One reason I can think of for that to be an external requirement is: We might not want a user to see the labels of the pools that he cannot access.) Just want to make sure this requirement is a must have.

Suppose we must filter out pools a user cannot access, would it be nice if we could consider this type of listing operation to be an extension of dmg pool list -v? "The MS lists pools and contact each one of the PSs" seems to be a common pattern potentially shared for future MS features. The pool access properties could be returned to the MS (similar to those for the -v in dmg pool list -v) or the agent to check the client credential, while the client credential is originated from the agent anyway. What do you think?

@kjacque
Copy link
Contributor

kjacque commented Jun 24, 2024

Suppose we must filter out pools a user cannot access, would it be nice if we could consider this type of listing operation to be an extension of dmg pool list -v? "The MS lists pools and contact each one of the PSs" seems to be a common pattern potentially shared for future MS features. The pool access properties could be returned to the MS (similar to those for the -v in dmg pool list -v) or the agent to check the client credential, while the client credential is originated from the agent anyway. What do you think?

One thing to keep in mind is that all ACL processing code is currently on the server side. We could change this and make it common code (which I would recommend over trying to re-implement this logic in Go). In that case we may need to expose it via libdaos to make it accessible to the agent.

@mjmac
Copy link
Contributor Author

mjmac commented Jun 24, 2024

We might not want a user to see the labels of the pools that he cannot access.) Just want to make sure this requirement is a must have.

This is absolutely a must-have requirement. Users must not have any visibility into or knowledge about pools to which they do not have access. The pool label may reveal sensitive information about the contents or use of the pool, for example. As we cannot trust anything on the client side of the client API, we can't do the filtering in libdaos.

In order to accomplish this filtering in the control plane only, I think the following would need to happen:

  • As Kris points out, we would need to make the ACL handling code available outside of the ds_sec library and add Go bindings for it
  • We'd need to add a pool list handler to the agent, which would query the MS for the list of pools, and then query each pool for its ACLs (via get-prop?)
  • After obtaining the ACL for a pool, the agent would use the ACL handling code to determine whether or not the client could connect to that pool, and filter out those to which the client does not have access

Pros:
* Minimizes changes to daos_engine code (and less C for me to mangle)
* Keeps most of the work being done closer to the client
* Some opportunities for caching in the agent? Could cache the list of pools and their ACLs for short periods of time to handle "bursty" client behavior

Cons:
* Requires significant refactoring to make security-sensitive ACL handling code available to the agent
* Continues to expand the agent's scope of responsibility
* Uses management network (agent <-> MS <-> PS) instead of storage fabric, for client-initiated traffic
* Potential races due to agents having stale pool data?

@mjmac
Copy link
Contributor Author

mjmac commented Jun 24, 2024

Regardless of exactly how we implement the backend side of the pool list capability, I think the front-end work can proceed. Before I go further down that road, however, I would appreciate some feedback from @mchaarawi on the API changes ... I'd prefer to avoid finding out that it's a no-go or that there was a better way to do it after I've sunk more time into it.

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no concerns; but just needs some required small updates for the task stuff

{dc_deprecated, 0},
{dc_mgmt_pool_list, sizeof(daos_mgmt_pool_list_t)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need to add a user task API for this function though? I'm not really sure this is required and you can probably not worry about changes in the task stuff.
otherwise changes here are not enough and you would need to add a few more things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I was following the pattern set by @kccain when he added the original implementation of pool list. I haven't really dug into it beyond that.

Given that I don't feel strongly either way, I'm happy to follow your guidance on the best way to implement this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchaarawi: Ping... Please advise on the recommended approach here. I'm getting ready to rebase all of this work on master, and would prefer to push that PR in as close to a final state as possible rather than iterating on it.

Also, I don't see much in the way of client API tests. Not particularly keen to invent all of that myself, but happy to work with someone who knows more about this side of things. I am planning to add ftests for this feature -- would that be acceptable in lieu of new client API unit tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didn't know you were still looking for my review on this. as i mentioned, you should just avoid adding this change with the task api, otherwise you still need to update other structures to support that.
for client api tests, im not sure what you mean. i believe you should add some tests there:
https://github.com/daos-stack/daos/blob/master/src/tests/suite/daos_mgmt.c
but if you want to use ftest, that is fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didn't know you were still looking for my review on this. as i mentioned, you should just avoid adding this change with the task api, otherwise you still need to update other structures to support that.

OK, is the task API deprecated? I have a superficial understanding of the client API's workings, so as I said I was just following the existing patterns. If I didn't use the task API, would I basically just rework daos_mgmt_list_pools() to just call dc_mgmt_pool_list() directly? Is there any downside to doing it this way?

for client api tests, im not sure what you mean. i believe you should add some tests there: https://github.com/daos-stack/daos/blob/master/src/tests/suite/daos_mgmt.c but if you want to use ftest, that is fine too.

Ah, OK. I was thinking of unit tests under src/client/api/tests. I'll just rework the suite test to use the re-added client API instead of the dmg helper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, is the task API deprecated? I have a superficial understanding of the client API's workings, so as I said I was just following the existing patterns.

no the task api is not deprecated. but to add the task API entry for this function, you still need to update:
https://github.com/daos-stack/daos/blob/mjmac/DAOS-15982/src/include/daos/task.h
to be safe.
otherwise you can just remove the task api entry for it, and do as you suggested:

If I didn't use the task API, would I basically just rework daos_mgmt_list_pools() to just call dc_mgmt_pool_list() directly? Is there any downside to doing it this way?

either way is fine for me.

Comment on lines +56 to +57
/** Pool label */
d_string_t mgpi_label;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say that this is an API breaking change. but it sounds like we only used this in dmg functions, so probably no user was using this before.

Comment on lines +82 to +85
int
daos_mgmt_list_pools(const char *group, daos_size_t *npools, daos_mgmt_pool_info_t *pools,
daos_event_t *ev);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no API description. so should add that.
I assume one can query the size with NULL pools buf, or provide an estimated buffer of certain size and the call return the actual size in npools?
this also needs some C tests added.

Copy link
Contributor Author

@mjmac mjmac Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no API description. so should add that.

Done.

I assume one can query the size with NULL pools buf, or provide an estimated buffer of certain size and the call return the actual size in npools?

Correct. This is following the pattern established by other pool client APIs.

this also needs some C tests added.

Ack. Will add them.

@daosbuild1
Copy link
Collaborator

@mjmac
Copy link
Contributor Author

mjmac commented Jun 26, 2024

After some discussion with @liw and @kccain (thanks!), I'm going to explore the idea of doing the pool access check in the list handler instead of making a call with the new "can this client access the pool" RPC. It would still require N RPCs to get the pool ACLs, but wouldn't require a new server<->server RPC that is probably only useful for this specific scenario.

One potential downside to this approach is that it would lose the ability to delegate the connectability decision to the pool service. In scenarios where a pool is queryable for its properties but would not allow clients to connect (or, indeed, wouldn't allow a specific client to connect even if the ACL allows it), the client would receive that pool in the list of connectable pools. This may be fine as an edge case.

@daosbuild1
Copy link
Collaborator

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from e092306 to 417ddec Compare June 27, 2024 20:25
@daosbuild1
Copy link
Collaborator

@mjmac
Copy link
Contributor Author

mjmac commented Jun 27, 2024

One potential downside to this approach is that it would lose the ability to delegate the connectability decision to the pool service. In scenarios where a pool is queryable for its properties but would not allow clients to connect (or, indeed, wouldn't allow a specific client to connect even if the ACL allows it), the client would receive that pool in the list of connectable pools. This may be fine as an edge case.

Can you provide an example of where we could hit a case like this?

Not offhand... I included that thought in case it seemed relevant to someone else. I think it's an acceptable edge case. The user-visible result of this would be a pool that they can't actually access, which would be annoying but not the end of the world, IMO.

I thought we would need to initiate a pool connect in order to perform the pool query at all.

Not when the pool query is being done from a server. There are server<->server RPCs that don't require a pool handle.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14575/10/testReport/

mjmac added 2 commits June 28, 2024 01:09
Quick-functional: true
Test-tag: ListClientPoolsTest
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 417ddec to 8b4158a Compare June 28, 2024 01:15
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14575/11/execution/node/1270/log

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 8b4158a to 3a8e76d Compare June 28, 2024 13:34
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14575/12/execution/node/1270/log

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 3a8e76d to d5a67e8 Compare June 28, 2024 17:30
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14575/13/testReport/

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from d5a67e8 to 528280c Compare June 28, 2024 21:12
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14575/14/execution/node/1270/log

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 528280c to 63aa2fe Compare June 28, 2024 23:24
@daosbuild1
Copy link
Collaborator

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 63aa2fe to 9900c7d Compare June 29, 2024 01:42
@daosbuild1
Copy link
Collaborator

@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 9900c7d to 9c0412f Compare June 29, 2024 01:45
@daosbuild1
Copy link
Collaborator

Skip-NLT: true
Skip-unit-test: true
Quick-functional: true
Test-tag: test_list_pools
Required-githooks: true
Change-Id: I829dd6e5d26ea8ff3f5c9ff0da758dcc92e90774
Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac force-pushed the mjmac/DAOS-15982 branch from 9c0412f to ad2d948 Compare June 29, 2024 04:37
@daosbuild1
Copy link
Collaborator

@mjmac
Copy link
Contributor Author

mjmac commented Jun 29, 2024

Alright, I think I've addressed all of the preliminary feedback and I've finished iterating on tests. Moving over to #14672 for the master PR.

@mjmac mjmac closed this Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants