-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build some infrastructure to locally run all nvFuser tests and parallalize across available devices. #3915
base: main
Are you sure you want to change the base?
Conversation
Review updated until commit 33e519a Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Tagging a lot of reviewers to get input. This isn't an urgent PR but I'm finding this local test infrastructure particularly helpful. |
run_nvfuser_tests.py
Outdated
|
||
def main(): | ||
# Add argument parsing for dry run | ||
if len(sys.argv) > 1 and sys.argv[1] == "--dry-run": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider argparse to make it easier to add more arguments in the future.
run_nvfuser_tests.py
Outdated
try: | ||
# Run nvidia-smi to get GPU count | ||
result = subprocess.run( | ||
["nvidia-smi", "--query-gpu=gpu_name", "--format=csv,noheader"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider subprocess.check_output("nvidia-smi -L | wc -l", shell=True)
which is simpler.
run_nvfuser_tests.py
Outdated
except KeyboardInterrupt: | ||
# Kill any running processes | ||
for process in current_processes.values(): | ||
if process is not None: | ||
process.kill() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except KeyboardInterrupt: | |
# Kill any running processes | |
for process in current_processes.values(): | |
if process is not None: | |
process.kill() | |
raise |
I'm not sure why this is needed given finally
has the same cleanup code.
run_nvfuser_tests.py
Outdated
multidevice_tests = [ | ||
test for test in all_tests if "multidevice" in os.path.basename(test).lower() | ||
] | ||
|
||
# Get non-multidevice tests | ||
single_device_tests = [ | ||
test | ||
for test in all_tests | ||
if "multidevice" not in os.path.basename(test).lower() | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multidevice_tests = [ | |
test for test in all_tests if "multidevice" in os.path.basename(test).lower() | |
] | |
# Get non-multidevice tests | |
single_device_tests = [ | |
test | |
for test in all_tests | |
if "multidevice" not in os.path.basename(test).lower() | |
] | |
multidevice_tests, singledevice_tests = [], [] | |
for test in all_tests: | |
(multidevice_tests, singledevice_tests)["multidevice" in os.path.basename(test).lower].append(test) |
run_nvfuser_tests.py
Outdated
other_tests = [ | ||
test | ||
for test in single_device_tests | ||
if os.path.basename(test) not in priority_tests | ||
] | ||
|
||
# Return multidevice tests separately, and ordered single device tests prioritizing long running tests first | ||
return multidevice_tests, priority_tests + other_tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other_tests = [ | |
test | |
for test in single_device_tests | |
if os.path.basename(test) not in priority_tests | |
] | |
# Return multidevice tests separately, and ordered single device tests prioritizing long running tests first | |
return multidevice_tests, priority_tests + other_tests | |
singledevice_tests.sort(key=lambda test: test in priority_tests, reverse=True) | |
return multidevice_tests, singledevice_tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this one. Should take/integrate.
run_nvfuser_tests.py
Outdated
|
||
# Separate multidevice tests | ||
multidevice_tests = [ | ||
test for test in all_tests if "multidevice" in os.path.basename(test).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multidevice tests are scattered in multiple files, e.g., test_communication.py, test_dtensor.py, and test_transformer_engine.py. We can even mix singledevice and multidevice in one test file. Currently, pytest *.py
only runs single-device tests and mpirun *.py --only-mpi
only runs multi-device tests. This works without glob
ing.
Is that good enough for manual testing? If we have to glob
, I could put multidevice tests all in one folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need is a strong word, it seems a bit cleaner to me to separate them out into files to make it more obvious to developers how to run a test individually.
@wujingyue I appreciate the review; your python is much better than mine, thanks for the cleanup suggestions. Curious to get your take on if you think this utility would be helpful. |
I believe so! I often run tests on a machine with multiple GPUs, my workstation or dlcluster. This should speed up my local run significantly. The alternative would be to make the existing manual_ci.sh support parallelization. But for complicated logic like that Python is much easier to write and read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually run the tests in a targeted way and just use !test
when ready to run them all once I don't know of any failing tests. But this could be useful on a multi-device machine. Just some stylistic comments.
run_nvfuser_tests.py
Outdated
return None | ||
|
||
|
||
def run_parallel_tests(log_dir, num_gpus, tests, run_func, dry_run=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use multiprocessing here? Something like
import multiprocessing
# ...
with multiprocessing.Pool(num_gpus) as pool:
pool.map(run_test, tests)
Then you could either try and map workers to GPUs or do something like share a Array of available GPUs. Each process would use the locking described there to take the first available, remove it, and return it to the array after execution.
Also, timeout can be given to subprocess.check_output, in which case you can catch TimeoutExpired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about multiprocessing.Pool
, if you want to take a stab at it I'd be happy with that. Does it use a worker queue or is it a fixed round robin mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work queue is nice since our tests are very lopsided when it comes to runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work-stealing with a Queue. I wouldn't say there's any real advantage now that you have already implemented this, but it is what I would usually reach for to launch concurrent tasks.
BTW, creating the summary is a job best done in Python, but for launching the tests I wonder how far one could get using GNU parallel... |
multidevice_tests, single_device_tests = [], [] | ||
for test in all_tests: | ||
(single_device_tests, multidevice_tests)[ | ||
"multidevice" in os.path.basename(test).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a code comment here that this misses multi-GPU tests in files like test_communication.py?
No description provided.