Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16645 cart: Bump file descriptor limit #15224

Merged
merged 29 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6f723ee
DAOS-16645 cart: Bump file descriptor limit
jolivier23 Sep 30, 2024
cce4e6d
clang-format
jolivier23 Sep 30, 2024
c7f7403
One more checkpatch issue
jolivier23 Sep 30, 2024
6c49a8e
Address review comments
jolivier23 Sep 30, 2024
28b70ad
Address review comment
jolivier23 Sep 30, 2024
5df6142
Allow-unstable-test: true
jolivier23 Sep 30, 2024
772d239
Address review comments
jolivier23 Oct 1, 2024
861b4cf
Since this adds a new warning, empty commit for pragma
jolivier23 Oct 1, 2024
e7c32be
Just a test to see if we can get around the extra NLT warning (#15231)
jolivier23 Oct 5, 2024
d9cee76
Merge branch 'master' into jvolivie/setrlimit
jolivier23 Oct 5, 2024
c63e298
Still fails in once case and not sure why
jolivier23 Oct 5, 2024
39bc7d9
change behavior for super user
jolivier23 Oct 5, 2024
9b5439d
fix typo
jolivier23 Oct 5, 2024
2312afc
clang format fix
jolivier23 Oct 5, 2024
5449989
Ok, not sure what is going on here. I have yet to
jolivier23 Oct 5, 2024
72bb00a
Merge branch 'master' into jvolivie/setrlimit
jolivier23 Oct 14, 2024
92c3afc
try another
jolivier23 Oct 15, 2024
e85614a
Bump rlimit to max if using valgrind
jolivier23 Oct 16, 2024
10e786c
Merge branch 'master' into jvolivie/setrlimit
jolivier23 Oct 16, 2024
c80b533
add extra commit
jolivier23 Oct 16, 2024
63b1496
remove superfluous parens
jolivier23 Oct 17, 2024
ed1d48a
Merge branch 'master' into jvolivie/setrlimit
jolivier23 Oct 17, 2024
bf5aee8
Move rlimit setting to main
jolivier23 Oct 17, 2024
2465445
retrigger
jolivier23 Oct 17, 2024
89f6a11
try again
jolivier23 Oct 17, 2024
99dbd33
Test again
jolivier23 Oct 18, 2024
808030f
Merge branch 'master' into jvolivie/setrlimit
jolivier23 Oct 18, 2024
560f030
test again
jolivier23 Oct 18, 2024
782c700
try again
jolivier23 Oct 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions ci/unit/test_nlt_node.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,9 @@ pip install --requirement requirements-utest.txt

pip install /opt/daos/lib/daos/python/

# set high open file limit in the shell to avoid extra warning
sudo prlimit --nofile=1024:262144 --pid $$
prlimit -n

./utils/node_local_test.py --max-log-size 1700MiB --dfuse-dir /localhome/jenkins/ \
--log-usage-save nltir.xml --log-usage-export nltr.json all
58 changes: 58 additions & 0 deletions src/cart/crt_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,61 @@ check_grpid(crt_group_id_t grpid)
return rc;
}

#define CRT_MIN_TCP_FD 131072

/** For some providers, we require a file descriptor for every connection
* and some platforms set the soft limit too low meaning and we run out. We can
* set the limit up to the configured max by default to avoid this and warn
* when that isn't possible.
*/
static void
file_limit_bump(void)
{
int rc;
struct rlimit rlim;

/* Bump file descriptor limit if low and if possible */
rc = getrlimit(RLIMIT_NOFILE, &rlim);
if (rc != 0) {
DS_ERROR(errno, "getrlimit() failed. Unable to check file descriptor limit");
/** Per the man page, this can only fail if rlim is invalid */
D_ASSERT(0);
return;
}

if (rlim.rlim_cur >= CRT_MIN_TCP_FD)
return;

if (rlim.rlim_max < CRT_MIN_TCP_FD) {
if (getuid() != 0) {
D_WARN("File descriptor hard limit should be at least %d, limit is %lu\n",
CRT_MIN_TCP_FD, rlim.rlim_max);
} else {
/** root should be able to change it */
D_INFO("Super user attempting to update hard file descriptor limit to %d,"
" limit was %lu\n",
CRT_MIN_TCP_FD, rlim.rlim_max);
rlim.rlim_max = CRT_MIN_TCP_FD;
}

if (rlim.rlim_cur >= rlim.rlim_max)
return;

/* May as well bump it as much as we can */
}

rlim.rlim_cur = rlim.rlim_max;
rc = setrlimit(RLIMIT_NOFILE, &rlim);
if (rc != 0) {
DS_ERROR(errno,
"setrlimit() failed. Unable to bump file descriptor"
" limit to value >= %d, limit is %lu",
CRT_MIN_TCP_FD, rlim.rlim_max);
return;
}
D_INFO("Updated soft file descriptor limit to %lu\n", rlim.rlim_max);
}

static void
prov_settings_apply(bool primary, crt_provider_t prov, crt_init_options_t *opt)
{
Expand All @@ -510,6 +565,9 @@ prov_settings_apply(bool primary, crt_provider_t prov, crt_init_options_t *opt)
d_setenv("FI_OFI_RXM_DEF_TCP_WAIT_OBJ", "pollfd", 0);
}

if (prov == CRT_PROV_OFI_TCP || prov == CRT_PROV_OFI_TCP_RXM)
file_limit_bump();

jolivier23 marked this conversation as resolved.
Show resolved Hide resolved
if (prov == CRT_PROV_OFI_CXI)
mrc_enable = 1;

Expand Down
1 change: 1 addition & 0 deletions utils/ansible/ftest/templates/daos-launch_nlt.sh.j2
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,7 @@ fi
info "Fixing daos_server_helper"
run sudo -E "$DAOS_SOURCE_DIR/utils/setup_daos_server_helper.sh"
run chmod -x "$DAOS_INSTALL_DIR/bin/daos_server_helper"
run sudo prlimit --nofile=1024:262144 --pid $$

info "Starting NLT tests"
pushd "$DAOS_SOURCE_DIR" > /dev/null
Expand Down
7 changes: 7 additions & 0 deletions utils/node_local_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import pwd
import random
import re
import resource
import shutil
import signal
import stat
Expand Down Expand Up @@ -6524,6 +6525,12 @@ def main():
parser.add_argument('mode', nargs='*')
args = parser.parse_args()

# valgrind reduces the hard limit unless we bump the soft limit first
if args.memcheck != "no":
(soft, hard) = resource.getrlimit(resource.RLIMIT_NOFILE)
if soft < hard:
resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard))

if args.server_fi:
server_fi(args)
return
Expand Down
Loading