-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16645 cart: Bump file descriptor limit #15224
Conversation
With tcp provider, using many sockets can cause significant file descriptor usage. Bump the soft limit, if possible and warn if it appears insufficient. Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Ticket title is 'Automatically bump soft file descriptor limit for tcp provider' |
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
src/cart/crt_init.c
Outdated
|
||
if (rlim.rlim_cur < MIN_TCP_FD) { | ||
if (rlim.rlim_max < MIN_TCP_FD) { | ||
D_ERROR("File descriptor soft limit should be at least %d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are checking rlim_max, shoudlnt error read 'hard limit' instead of soft here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/cart/crt_init.c
Outdated
DS_ERROR(errno, | ||
"setrlimit() failed. Unable to bump file descriptor" | ||
" limit to suitable value (>= %d)", | ||
MIN_TCP_FD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not print rlim_max instead to know what value exactly it failed on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
src/cart/crt_init.c
Outdated
DS_ERROR(errno, | ||
"setrlimit() failed. Unable to bump file descriptor" | ||
" limit to value >= %d, limit is %lu", | ||
MIN_TCP_FD, rlim.rlim_max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goto next; missing here? ele on error you will print d_info below:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, good catch. I added the print later.
I thought we had some documentation that recommends setting the nopen files to 1048576 somewhere. |
you might be right and I can change it to a minimum of 1048576 if we want that. I just think we should bump it automatically if we can regardless. |
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Looks like we don't or at least I can't find it anywhere. |
yea i couldn't find it either but i know it was a requirement for using tcp on boro for example |
We've been using 131072 in our documentation for parallelstore. Should be sufficient for most cases as with our largest instance and dfuse with 8 eq would only use around 3500 fds for sockets. Anyway, I think it's a reasonable minimum and this patch will set it higher if allowed. |
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
fine with me. ill defer my +1 to @soumagne |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minus the comments that I had, the PR looks good to me overall.
src/cart/crt_init.c
Outdated
@@ -492,10 +492,14 @@ check_grpid(crt_group_id_t grpid) | |||
return rc; | |||
} | |||
|
|||
#define MIN_TCP_FD 131072 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please prefix internal macros with CRT_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/cart/crt_init.c
Outdated
/* Bump file descriptor limit if low and if possible */ | ||
rc = getrlimit(RLIMIT_NOFILE, &rlim); | ||
if (rc != 0) { | ||
DS_WARN(errno, "getrlimit() failed. Unable to check file descriptor limit"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a warn ? to me it sounds like it should be an error with appropriate errno checking ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Looking at the man page, it doesn't look like it should ever fail in our case.
src/cart/crt_init.c
Outdated
if (rlim.rlim_cur < MIN_TCP_FD) { | ||
if (rlim.rlim_max < MIN_TCP_FD) { | ||
D_ERROR("File descriptor hard limit should be at least %d\n", | ||
MIN_TCP_FD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider that as a warning, not as an error, if let's say someone wanted for any reason to have a lower limit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is a warning - in effect it's asking the admin to reconfigure the node but there's no error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/cart/crt_init.c
Outdated
if (rlim.rlim_cur < MIN_TCP_FD) { | ||
if (rlim.rlim_max < MIN_TCP_FD) { | ||
D_ERROR("File descriptor hard limit should be at least %d\n", | ||
MIN_TCP_FD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is a warning - in effect it's asking the admin to reconfigure the node but there's no error here.
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
utils/node_local_test.py
Outdated
# valgrind reduces the hard limit unless we bump the soft limit first | ||
(soft, hard) = resource.getrlimit(resource.RLIMIT_NOFILE) | ||
if soft < hard: | ||
resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better set after passing the command line at https://github.com/daos-stack/daos/blob/master/utils/node_local_test.py#L6530
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Allow-unstable-test: true Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
Required-githooks: true Signed-off-by: Jeff Olivier <[email protected]>
With tcp provider, using many sockets can cause significant file descriptor usage. Bump the soft limit, if possible and warn if it appears insufficient. Valgrind sets hard limit to soft limit, so work around that in NLT. Signed-off-by: Jeff Olivier <[email protected]>
With tcp provider, using many sockets can cause significant file descriptor usage. Bump the soft limit, if possible and warn if it appears insufficient. Valgrind sets hard limit to soft limit, so work around that in NLT. Signed-off-by: Jeff Olivier <[email protected]>
With tcp provider, using many sockets can cause significant file descriptor usage. Bump the soft limit, if possible and warn if it appears insufficient. Valgrind sets hard limit to soft limit, so work around that in NLT. Signed-off-by: Jeff Olivier <[email protected]>
With tcp provider, using many sockets can cause significant file descriptor usage. Bump the soft limit, if possible and warn if it appears insufficient.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: