Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running multiple nemo run tasks in parallel with DockerExecutor #57

Open
Kipok opened this issue Sep 19, 2024 · 0 comments
Open

Comments

@Kipok
Copy link
Collaborator

Kipok commented Sep 19, 2024

By this I mean that I will run multiple isolated scripts

python script1.py &
python script2.py &
...
wait

Currently when trying to do this, I get an error like below

───────────────────────────────────────────────────────────────────── Entering Experiment llm-math-judge with id: llm-math-judge_1726789456 ──────────────────────────────────────────────────────────────────────
[16:44:16] Launching task nemo-run for experiment llm-math-judge                                                                                                                                 experiment.py:601
[16:44:21] Error running task nemo-run: 409 Client Error for http+docker://localhost/v1.46/containers/create?name=nemo-run-0: Conflict ("Conflict. The container name "/nemo-run-0" is already   experiment.py:622
           in use by container "7591568f4b184e6134be9b92f4434c06242ca96d86654346854feb627028686a". You have to remove (or rename) that container to be able to reuse that name.")                                 
           Traceback (most recent call last):                                                                                                                                                    experiment.py:623
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 275, in _raise_for_status                                                                      
               response.raise_for_status()                                                                                                                                                                        
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status                                                                        
               raise HTTPError(http_error_msg, response=self)                                                                                                                                                     
            requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.46/containers/create?name=nemo-run-0                                                                    
                                                                                                                                                                                                                  
           The above exception was the direct cause of the following exception:                                                                                                                                   
                                                                                                                                                                                                                  
            Traceback (most recent call last):                                                                                                                                                                    
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/experiment.py", line 616, in run                                                                              
               job.launch(wait=wait, runner=self._runner)                                                                                                                                                         
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/job.py", line 340, in launch                                                                                  
               handle, status = launch(                                                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/launcher.py", line 99, in launch                                                               
               app_handle = runner.run(                                                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/runner.py", line 87, in run                                                                    
               handle = self.schedule(dryrun_info)                                                                                                                                                                
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/runner.py", line 102, in schedule                                                              
               app_id = sched.schedule(dryrun_info)                                                                                                                                                               
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/run/torchx_backend/schedulers/docker.py", line 109, in schedule                                                   
               req.run(client=client)                                                                                                                                                                             
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/core/execution/docker.py", line 328, in run                                                                       
               container_details.append(container.run(client=client, id=self.id))                                                                                                                                 
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/nemo_run/core/execution/docker.py", line 269, in run                                                                       
               return client.containers.run(                                                                                                                                                                      
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/models/containers.py", line 876, in run                                                                             
               container = self.create(image=image, command=command,                                                                                                                                              
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/models/containers.py", line 935, in create                                                                          
               resp = self.client.api.create_container(**create_kwargs)                                                                                                                                           
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/container.py", line 440, in create_container                                                                    
               return self.create_container_from_config(config, name, platform)                                                                                                                                   
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/container.py", line 457, in create_container_from_config                                                        
               return self._result(res, True)                                                                                                                                                                     
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 281, in _result                                                                                
               self._raise_for_status(response)                                                                                                                                                                   
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/api/client.py", line 277, in _raise_for_status                                                                      
               raise create_api_error_from_http_exception(e) from e                                                                                                                                               
              File "/home/igitman/anaconda3/envs/base-env/lib/python3.10/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception                                                        
               raise cls(e, response=response, explanation=explanation) from e                                                                                                                                    
            docker.errors.APIError: 409 Client Error for http+docker://localhost/v1.46/containers/create?name=nemo-run-0: Conflict ("Conflict. The container name "/nemo-run-0" is already in                     
           use by container "7591568f4b184e6134be9b92f4434c06242ca96d86654346854feb627028686a". You have to remove (or rename) that container to be able to reuse that name.")                                    
                                                                                                                                                                                                                  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant