Workflow for AutoTP #4961

delock · 2024-01-16T10:05:27Z

This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.

delock · 2024-01-17T03:48:38Z

The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.

set_mempolicy: Operation not permitted
setting membind: Operation not permitted

delock · 2024-01-17T12:19:51Z

On my system docker container needs to be started with SYS_NICE capability with the following flag.

  --cap-add SYS_NICE

Not sure how to turn on this for DeepSpeed runner.

Without this capability, we have to remove --bind_cores_to_rank flag, but this would significantly slow down the running time of the test. @mrwyattii what's your thinking on this? We can remove --bind_cores_to_rank to let the workflow run first, then work on how to enable SYS_NICE capability, does it work?

delock · 2024-01-19T09:53:55Z

A proper behavior of DeepSpeed --bind_cores_to_rank is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in DeepSpeed had been verified on my own runner, with and without SYS_NICE capability.
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581455004/job/20649083143
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581918228/job/20650446510

delock · 2024-01-22T02:59:29Z

Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!

delock · 2024-01-22T07:46:04Z

@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves.

tjruwase · 2024-01-22T17:17:47Z

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

delock · 2024-01-24T01:29:58Z

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

loadams · 2024-01-24T16:31:26Z

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

delock · 2024-01-29T05:00:21Z

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists.

delock · 2024-01-31T03:09:30Z

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

loadams · 2024-02-05T18:25:07Z

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

delock · 2024-02-06T06:10:36Z

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

delock · 2024-02-06T14:50:03Z

@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure.
https://pypi.org/project/intel-extension-for-pytorch/

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

delock · 2024-02-28T03:04:24Z

@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks!

delock · 2024-03-13T14:44:40Z

Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks!

delock · 2024-03-16T02:20:33Z

hi @loadams @tjruwase can you help start this work flow? thanks!

delock · 2024-03-28T03:05:56Z

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

loadams · 2024-03-28T16:03:44Z

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

@delock - yes, apologies that took so long.

delock · 2024-04-01T06:50:43Z

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

loadams · 2024-04-01T16:06:12Z

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

Re-running now

delock · 2024-04-08T03:32:24Z

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

loadams · 2024-04-08T15:46:44Z

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

Done

delock · 2024-04-09T13:03:44Z

For Baichuan model failure. I'm seeing it pass on my local environment with exactly the same arguments. From failed log in the workflow I see 'file not found' error when acquiring a lock. Suspect because of HF_HOME had not been properly set. Will point HF_HOME to /blob/ to try again

runner/_work/DeepSpeed/DeepSpeed/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 43, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left", trust_remote_code=trust_remote_code)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 829, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2047, in from_pretrained
    resolved_vocab_files[file_id] = cached_file(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1451, in hf_hub_download
    with WeakFileLock(lock_path):
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/utils/_fixes.py", line 83, in WeakFileLock
    lock.acquire()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/filelock/_api.py", line 262, in acquire
    self._acquire()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/filelock/_unix.py", line 44, in _acquire
    os.fchmod(fd, self._context.mode)
FileNotFoundError: [Errno 2] No such file or directory

delock · 2024-04-09T13:08:42Z

@loadams HF_HOME had been pointed to /blob/hf_home, can you help start the workflow to see whether the lock file not found issue has been fixed? Thanks!

delock · 2024-04-10T06:51:46Z

Hi @loadams after reading the error log I suspect Baichuan model under TRANSFORMERS_CACHE is corrupted. I unset TRANSFORMERS_CACHE since we set HF_HOME for this model. I also add a peek to TRANSFORMERS_CACHE and HF_HOME in case a manual cleanup will be needed. Can you help start the workflow? Thanks!

delock · 2024-04-11T02:46:28Z

@loadams @tjruwase the latest error in Baichuan model AutoTP is very wierd. It complains about lock file not found or attribute not found. Which I cannot reproduce locally. It indicates probably there is some courrpted states in hf_hub downloaded data.

Currenlty bloom and opt model AutoTP is consistently running well. Can we merge this baseline first then seek add new autotp model validation in followup PR? It might take a while to debug this issue. I'll submit a commit disable baichuan test first.

update trigger condition

8ff1d07

delock requested review from mrwyattii and loadams as code owners January 16, 2024 10:05

change data type to bfloat16

35ce89b

delock mentioned this pull request Jan 16, 2024

[TASK] Seperate AutoTP workflow #4894

Open

add probe of runner memory capacity

f11a434

loadams and others added 3 commits January 18, 2024 11:47

Merge branch 'master' into gma/add_autotp_workflow

0ae3bdb

Merge branch 'up-master' into gma/add_autotp_workflow

643161a

skip numa membind when this capability is not available

f8b5a53

delock requested review from tjruwase and awan-10 as code owners January 19, 2024 09:45

Merge branch 'master' into gma/add_autotp_workflow

8add759

Merge branch 'master' into gma/add_autotp_workflow

2c21865

remove TRANSFORMERS_CACHE_OVERRIDING

33caddd

Merge branch 'master' into gma/add_autotp_workflow

552aa5b

tjruwase and others added 2 commits February 6, 2024 06:50

Merge branch 'master' into gma/add_autotp_workflow

0961d5f

Use official DeepSpeedExamples

092d007

loadams and others added 4 commits February 28, 2024 07:12

Merge branch 'master' into gma/add_autotp_workflow

a07f3a7

Merge branch 'master' into gma/add_autotp_workflow

b366429

put oneCCL binding for PyTorch into CPU-inference workflow

4e26186

trust remote code and don't use meta tensor

42211a4

Merge branch 'master' into gma/add_autotp_workflow

68fb2ec

Merge branch 'master' into gma/add_autotp_workflow

8df7f87

Merge branch 'master' into gma/add_autotp_workflow

4d6107c

Merge branch 'master' into gma/add_autotp_workflow

934f9fd

delock added 2 commits April 7, 2024 23:24

remove unit test from cpu_inference workflow

44b8850

Use stock PyTorch to test AutoTP

147492d

Merge branch 'master' into gma/add_autotp_workflow

0de43f0

add env HF_HOME to point to /blob/hf_home

70be14a

loadams and others added 2 commits April 9, 2024 08:54

Merge branch 'master' into gma/add_autotp_workflow

de8de90

unset TRANSFORMERS_CACHE and peek tree structure

d99015f

Merge branch 'master' into gma/add_autotp_workflow

4ce4729

delock and others added 2 commits April 11, 2024 10:47

disable baichuan autotp test

6fcb337

Merge branch 'master' into gma/add_autotp_workflow

b679f38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow for AutoTP #4961

Workflow for AutoTP #4961

delock commented Jan 16, 2024

delock commented Jan 17, 2024

delock commented Jan 17, 2024 •

edited

Loading

delock commented Jan 19, 2024 •

edited

Loading

delock commented Jan 22, 2024

delock commented Jan 22, 2024 •

edited

Loading

tjruwase commented Jan 22, 2024

delock commented Jan 24, 2024

loadams commented Jan 24, 2024

delock commented Jan 29, 2024 •

edited

Loading

delock commented Jan 31, 2024 •

edited

Loading

loadams commented Feb 5, 2024

delock commented Feb 6, 2024

delock commented Feb 6, 2024

delock commented Feb 28, 2024

delock commented Mar 13, 2024

delock commented Mar 16, 2024

delock commented Mar 28, 2024 •

edited

Loading

loadams commented Mar 28, 2024

delock commented Apr 1, 2024

loadams commented Apr 1, 2024

delock commented Apr 8, 2024

loadams commented Apr 8, 2024

delock commented Apr 9, 2024

delock commented Apr 9, 2024

delock commented Apr 10, 2024

delock commented Apr 11, 2024

Workflow for AutoTP #4961

Are you sure you want to change the base?

Workflow for AutoTP #4961

Conversation

delock commented Jan 16, 2024

delock commented Jan 17, 2024

delock commented Jan 17, 2024 • edited Loading

delock commented Jan 19, 2024 • edited Loading

delock commented Jan 22, 2024

delock commented Jan 22, 2024 • edited Loading

tjruwase commented Jan 22, 2024

delock commented Jan 24, 2024

loadams commented Jan 24, 2024

delock commented Jan 29, 2024 • edited Loading

delock commented Jan 31, 2024 • edited Loading

loadams commented Feb 5, 2024

delock commented Feb 6, 2024

delock commented Feb 6, 2024

delock commented Feb 28, 2024

delock commented Mar 13, 2024

delock commented Mar 16, 2024

delock commented Mar 28, 2024 • edited Loading

loadams commented Mar 28, 2024

delock commented Apr 1, 2024

loadams commented Apr 1, 2024

delock commented Apr 8, 2024

loadams commented Apr 8, 2024

delock commented Apr 9, 2024

delock commented Apr 9, 2024

delock commented Apr 10, 2024

delock commented Apr 11, 2024

delock commented Jan 17, 2024 •

edited

Loading

delock commented Jan 19, 2024 •

edited

Loading

delock commented Jan 22, 2024 •

edited

Loading

delock commented Jan 29, 2024 •

edited

Loading

delock commented Jan 31, 2024 •

edited

Loading

delock commented Mar 28, 2024 •

edited

Loading