Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Log test results to file #6627

Draft
wants to merge 66 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
16a8cb6
log test run
tohtana Oct 15, 2024
3ff9cea
enable logging in workflow
tohtana Oct 15, 2024
101bab7
run grep regardless of pytest return code
tohtana Oct 15, 2024
c04d6c1
fix return code from grep
tohtana Oct 15, 2024
d58b427
exclude skipped tests from failure logging
tohtana Oct 15, 2024
5434f53
fix handling return code
tohtana Oct 15, 2024
3d27593
Merge branch 'master' into tohtana/log_run_tests
tohtana Oct 16, 2024
75fe4ad
add logging in tests
tohtana Oct 16, 2024
18d2da1
Merge branch 'tohtana/log_run_tests' of github.com:microsoft/DeepSpee…
tohtana Oct 16, 2024
1e6b3e5
Merge branch 'master' into tohtana/log_run_tests
tohtana Oct 16, 2024
a1c766b
disable NCCL_SOCKET_IFNAME
tohtana Oct 16, 2024
56febde
fix args for test func
tohtana Oct 16, 2024
7fab557
pin torch version
tohtana Oct 16, 2024
b0091a9
Merge branch 'master' into tohtana/log_run_tests
tohtana Oct 23, 2024
409ed6d
unpin torch version
tohtana Oct 23, 2024
969b7f7
Merge branch 'master' into tohtana/log_run_tests
tohtana Oct 24, 2024
8c4cd1d
set file path for filestore
tohtana Oct 24, 2024
6a7b640
use /dev/shm for filestore
tohtana Oct 24, 2024
9d0216a
Merge branch 'tohtana/log_run_tests' of github.com:microsoft/DeepSpee…
tohtana Oct 24, 2024
7508150
add info to tag
tohtana Oct 25, 2024
e52ca96
shorten process group timeout
tohtana Oct 25, 2024
58cb5a9
set device
tohtana Oct 25, 2024
9e64183
Run on specialized runner
loadams Oct 25, 2024
3fad973
set blank to NCCL_SOCKET_IFNAME
tohtana Oct 25, 2024
2096a1a
Merge branch 'tohtana/log_run_tests' of github.com:microsoft/DeepSpee…
tohtana Oct 25, 2024
6669f93
Merge branch 'master' into tohtana/log_run_tests
loadams Oct 28, 2024
6bef245
pass error in test to parent process
tohtana Oct 28, 2024
95a6426
Merge branch 'tohtana/log_run_tests' of github.com:microsoft/DeepSpee…
tohtana Oct 28, 2024
b143903
set timeout of closing pool
tohtana Oct 28, 2024
4357a6e
recreate pool when test fails
tohtana Oct 28, 2024
07c18c8
add log outputs
tohtana Oct 28, 2024
b221b5f
fix flag
tohtana Oct 28, 2024
fafb2d9
handle nccl error
tohtana Oct 29, 2024
dcb3bbd
init pg exclusively
tohtana Oct 29, 2024
48561fa
fix lock
tohtana Oct 29, 2024
616eb4d
fix removal of lock file
tohtana Oct 29, 2024
fa4bcec
use O_EXCL for lock
tohtana Oct 29, 2024
acc77d9
simplify lock
tohtana Oct 29, 2024
c8612d8
add random wait
tohtana Oct 29, 2024
65111c1
increase retry count
tohtana Oct 29, 2024
44fb6fe
stop using init_process_group_exclusively
tohtana Oct 29, 2024
a1e4eee
catch nccl init error
tohtana Oct 29, 2024
a1c0123
change timeout
tohtana Oct 29, 2024
0afe7d1
enable reuse_dist_env
tohtana Oct 29, 2024
3649914
set reuse_dist_env=True as default
tohtana Oct 29, 2024
ecc93f9
do not reuse dist env for non-daemonic process
tohtana Oct 29, 2024
96d520f
fix device selection for reuse dist env
tohtana Oct 29, 2024
f7573d1
record pool cache at every test
tohtana Oct 29, 2024
91fc68a
fix teadown
tohtana Oct 29, 2024
54bb4e6
fix condition to clean process pool
tohtana Oct 29, 2024
46a4ac8
fix teardown
tohtana Oct 29, 2024
85fa337
add condition of cleaning
tohtana Oct 29, 2024
4dbfb51
add test
tohtana Oct 30, 2024
3d6b7ea
move call to set device
tohtana Oct 30, 2024
65ffac9
fix world size
tohtana Oct 30, 2024
61409dd
Merge branch 'master' into tohtana/log_run_tests
loadams Oct 30, 2024
c420d42
add cleaning of global state
tohtana Oct 30, 2024
35ccf6c
Merge branch 'tohtana/log_run_tests' of github.com:microsoft/DeepSpee…
tohtana Oct 30, 2024
bebb59c
Switch version back to run on non-debug runners
loadams Oct 31, 2024
3a880da
Merge branch 'master' into tohtana/log_run_tests
tohtana Oct 31, 2024
b6da93d
Merge branch 'master' into tohtana/log_run_tests
loadams Nov 1, 2024
89f03af
Fix after merge
loadams Nov 1, 2024
5f3b63f
Fix function signature from merge conflicts
loadams Nov 1, 2024
e6a6705
Add mpi4py
loadams Nov 1, 2024
5e59b82
Merge branch 'master' into tohtana/log_run_tests
loadams Nov 4, 2024
bb4c5b6
Merge branch 'master' into tohtana/log_run_tests
loadams Nov 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/cpu-torch-latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ jobs:

- name: Unit tests
run: |
TEST_LOG_FILE="/tmp/test_log_cpu_${GITHUB_RUN_ID}.log"
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.5"
RUNNING_TEST_LOG_FILE=${TEST_LOG_FILE} DS_UNITTEST_FILE_STORE_DIR=/dev/shm HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.5"
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.5"
29 changes: 25 additions & 4 deletions .github/workflows/nv-torch-latest-v100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ concurrency:

jobs:
unit-tests:
runs-on: [self-hosted, nvidia, cu121, v100]
runs-on: [self-hosted, nvidia, cu121, v100] # Modified to run on the test runner

steps:
- uses: actions/checkout@v4
Expand All @@ -44,7 +44,7 @@ jobs:

- name: Install deepspeed
run: |
pip install .[dev,1bit,autotuning]
pip install .[dev,1bit,1bit-mpi,autotuning]
ds_report

- name: Python environment
Expand All @@ -55,5 +55,26 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="2.5" --cuda_ver="12.1"
pytest $PYTEST_OPTS --forked -m 'sequential' unit/ --torch_ver="2.5" --cuda_ver="12.1"
TEST_LOG_FILE="/tmp/test_log_${GITHUB_RUN_ID}.log"
echo "Running tests and logging to ${TEST_LOG_FILE}"
# Let this line return true so that we can grep for "Failed" in the log file
set +e
pytest -s unit/comm/test_dist.py::TestDistInferenceAllReduce
NCCL_SOCKET_IFNAME="" DS_UNITTEST_FILE_STORE_DIR=/dev/shm RUNNING_TEST_LOG_FILE=${TEST_LOG_FILE} pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="2.5" --cuda_ver="12.1"
PYTEST_EXIT_CODE=$?
if [ $PYTEST_EXIT_CODE -ne 0 ]; then
# We don't clean the file here for debugging
echo "pytest failed with exit code $PYTEST_EXIT_CODE"
exit $PYTEST_EXIT_CODE
fi
grep "Failed" ${TEST_LOG_FILE}
rm -f ${TEST_LOG_FILE}
# Do the same as above
DS_UNITTEST_FILE_STORE_DIR=/dev/shm RUNNING_TEST_LOG_FILE=${TEST_LOG_FILE} pytest $PYTEST_OPTS --forked -m 'sequential' unit/ --torch_ver="2.5" --cuda_ver="12.1"
PYTEST_EXIT_CODE=$?
grep "Failed" ${TEST_LOG_FILE}
if [ $PYTEST_EXIT_CODE -ne 0 ]; then
echo "pytest failed with exit code $PYTEST_EXIT_CODE"
exit $PYTEST_EXIT_CODE
fi
rm -f ${TEST_LOG_FILE}
36 changes: 35 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,47 @@ def pytest_runtest_call(item):
item.runtest = lambda: True # Dummy function so test is not run twice


def write_to_log_with_lock(log_file_path: str, header: str, msg: str):
import fcntl
with open(log_file_path, 'a+') as f:
try:
fcntl.flock(f, fcntl.LOCK_EX)
f.write(f"{header} {msg}\n")
f.flush()
finally:
fcntl.flock(f, fcntl.LOCK_UN)


dist_test_class = None


# We allow DistributedTest to reuse distributed environments. When the last
# test for a class is run, we want to make sure those distributed environments
# are destroyed.
def pytest_runtest_teardown(item, nextitem):
if getattr(item.cls, "reuse_dist_env", False) and not nextitem:
RUNNING_TEST_LOG_FILE = os.environ.get("RUNNING_TEST_LOG_FILE", "/tmp/running_test.log")

global dist_test_class
# Last test might not have .cls. So we record the pool_cache here
if item.cls is not None:
dist_test_class = item.cls()

def get_xdist_worker_id():
xdist_worker = os.environ.get('PYTEST_XDIST_WORKER', None)
if xdist_worker is not None:
xdist_worker_id = xdist_worker.replace('gw', '')
return int(xdist_worker_id)
return None

if RUNNING_TEST_LOG_FILE:
reuse_dist_env = getattr(item.cls, "reuse_dist_env", False)
write_to_log_with_lock(RUNNING_TEST_LOG_FILE, f"pytest_runtest_teardown,xdist={get_xdist_worker_id()}",
f"reuse_dist_env={reuse_dist_env} nextitem={nextitem}")

if not nextitem and dist_test_class is not None and dist_test_class._pool_cache is not None:
for num_procs, pool in dist_test_class._pool_cache.items():
write_to_log_with_lock(RUNNING_TEST_LOG_FILE, f"pytest_runtest_teardown,xdist={get_xdist_worker_id()}",
f"closing pool num_procs={num_procs} nextitem={nextitem}")
dist_test_class._close_pool(pool, num_procs, force=True)


Expand Down
16 changes: 4 additions & 12 deletions tests/unit/comm/test_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,12 +112,7 @@ def test(self, distributed_fixture, class_tmpdir, val1, val2):

class TestDistAllReduce(DistributedTest):
device_count = get_accelerator().device_count()
if device_count >= 4:
world_size = [1, 2, 4]
elif device_count >= 2:
world_size = [1, 2]
else:
world_size = [1]
world_size = 2

def test(self):
x = torch.ones(1, 3).to(get_accelerator().device_name()) * (dist.get_rank() + 1)
Expand All @@ -130,20 +125,17 @@ def test(self):
@pytest.mark.parametrize("dtype", [torch.float32, torch.bfloat16, torch.float16])
class TestDistInferenceAllReduce(DistributedTest):
device_count = get_accelerator().device_count()
if device_count >= 4:
world_size = [1, 2, 4]
elif device_count >= 2:
world_size = [1, 2]
else:
world_size = [1]
world_size = 2

def test(self, dtype):
x = torch.ones(1, 3).to(get_accelerator().device_name()) * (dist.get_rank() + 1)
sum_of_ranks = (dist.get_world_size() * (dist.get_world_size() + 1)) // 2
result = torch.ones(1, 3).to(get_accelerator().device_name()) * sum_of_ranks
result = result.to(dtype)
x = x.to(dtype)
print(f"Rank {dist.get_rank()} x: {x}")
dist.inference_all_reduce(x)
print(f"AR Rank {dist.get_rank()} x: {x}")
assert torch.all(x == result)


Expand Down
Loading
Loading