Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Throughput vs Gym AsyncVectorEnv #1325

Closed
3 tasks done
ShaneFlandermeyer opened this issue Jun 27, 2023 · 32 comments · Fixed by #1519
Closed
3 tasks done

[BUG] Throughput vs Gym AsyncVectorEnv #1325

ShaneFlandermeyer opened this issue Jun 27, 2023 · 32 comments · Fixed by #1519
Assignees
Labels
bug Something isn't working

Comments

@ShaneFlandermeyer
Copy link

ShaneFlandermeyer commented Jun 27, 2023

Describe the bug

Hello, I'm performing experiments that use a relatively small number of parallel environments (8-16). Using the PongNoFrameskip-v4 environment with no wrappers, it seems that TorchRL is 4-5x slower than Gym's AsyncVectorEnv (2600 vs 11000 FPS) with a random policy. Given the throughput results in Table 2 of the paper, I would expect comparable performance. Am I setting up the environments incorrectly?

To Reproduce

This is a very simple adaptation of the script in examples/distributed/single_machine/generic.py. Although it's not shown here, I observe similar performance with ParallelEnv and a synchronous collector.

import time
from argparse import ArgumentParser

import torch
import tqdm

from torchrl.collectors.collectors import (
    MultiaSyncDataCollector,
    MultiSyncDataCollector,
    RandomPolicy,
)
from torchrl.envs import EnvCreator
from torchrl.envs.libs.gym import GymEnv
import gymnasium as gym

parser = ArgumentParser()
parser.add_argument(
    "--num_workers", default=8, type=int, help="Number of workers in each node."
)
parser.add_argument(
    "--total_frames",
    default=500_000,
    type=int,
    help="Total number of frames collected by the collector. Must be "
    "divisible by the product of nodes and workers.",
)
parser.add_argument(
    "--env",
    default="PongNoFrameskip-v4",
    help="Gym environment to be run.",
)
if __name__ == "__main__":
    args = parser.parse_args()
    num_workers = args.num_workers
    frames_per_batch = 10*args.num_workers
    
    # Test asynchronous gym collector
    env = gym.vector.AsyncVectorEnv([lambda: gym.make(args.env) for _ in range(num_workers)])
    env.reset()
    global_step = 0
    start = time.time()
    for _ in range(args.total_frames//num_workers):
        global_step += num_workers
        env.step(env.action_space.sample())
        stop = time.time()
        if global_step % int(num_workers*1_000) == 0:
            print('FPS:', global_step / (stop - start))
    env.close()

    # Test multiprocess TorchRL collector
    device = 'cuda:0'
    make_env = EnvCreator(lambda: GymEnv(args.env, device=device))
    action_spec = make_env().action_spec
    collector = MultiaSyncDataCollector(
        [make_env] * num_workers,
        policy=RandomPolicy(action_spec),
        total_frames=args.total_frames,
        frames_per_batch=frames_per_batch,
        devices=device,
        storing_devices=device,
    )
    counter = 0
    for i, data in enumerate(collector):
        if i == 10:
            pbar = tqdm.tqdm(total=collector.total_frames)
            t0 = time.time()
        if i >= 10:
            counter += data.numel()
            pbar.update(data.numel())
            pbar.set_description(f"data shape: {data.shape}, data device: {data.device}")
    collector.shutdown()
    t1 = time.time()
    print(f"time elapsed: {t1-t0}s, rate: {counter/(t1-t0)} fps")
    exit()

System info

TorchRL installed via pip (v0.1.1)

import torchrl, numpy, sys
print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)

None 1.22.0 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] linux

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
@ShaneFlandermeyer ShaneFlandermeyer added the bug Something isn't working label Jun 27, 2023
@vmoens
Copy link
Contributor

vmoens commented Jun 27, 2023

Thanks for reporting this! We appreciate the feedback.

The results of the paper were obtained using collectors, not parallel envs. I will be posting the code shortly for reproducibility.
See this discussion for more context.

Also, I see that you're using the latest stable version of the library (which is good!). You'll be happy to know that we've speeded up a bunch of operations in tensordict and vectorized envs and the nightly releases of tensordict and torchrl should give you better results. There are other optimizations we can do so I'm confident we can accelerate things even more.

Executing a slightly modified version of the code above, on my McBook I get to a speed for TorchRL that is 2x slower than the gym one. The overhead is mainly caused by tensordict operations.

However, using the config we had in the paper (mainly 32 procs instead of 8 and more cuda devices used for passing data from one proc to another), I get to a speed of 8k fps as reported in the paper. When using 4 parallel envs / collector and 8 collectors, I get to a speed of 16k fps.

I will keep on updating this post and related as we optimize things further.

@skandermoalla
Copy link
Contributor

Hey!

I adapted the benchmark for a more straightforward comparison and ran on all the relevant parallel collectors (specs, code, and output below).
This is what I got with 8 envs in parallel and 2000 vs 80 frames per batch:

Summary:

# Output is for
# Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).   
# Running 8 envs with 80 frames per batch (i.e. 10.0 frames per env).                                                  

FPS Gym AsyncVectorEnv mean: 11587.749470459721
FPS Gym AsyncVectorEnv mean: 11317.07864610709                                                                       
 
FPS TorchRL with MultiSyncDataCollector on cpu mean: 2866.4653702739765
FPS TorchRL with MultiSyncDataCollector on cpu mean: 979.1281444006343                                                                                                                                                                    

FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 5108.929550860561
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 3225.940279205892                                                                                                                                                                 

FPS TorchRL with MultiaSyncDataCollector on cpu mean: 4349.725160643495 
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1700.2908940649854                                                                                                                                                                  

FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 4745.486839045239                                                                                                                                                                    
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 5223.255705987036                                                                                                                                                                

FPS TorchRL with SyncDataCollector on cpu mean: 587.1912214549519    
FPS TorchRL with SyncDataCollector on cpu mean: 547.1234236961403

FPS TorchRL with SyncDataCollector on cuda:0 mean: 3171.8938244285127  
FPS TorchRL with SyncDataCollector on cuda:0 mean: 2424.713380369402                                                                                                                                    

(SyncDataCollector is running a ParallelEnv)

As @ShaneFlandermeyer got, Gym's AsyncVectorEnv runs both at 11k FPS.
For TorchRL, the highest FPS I got was 5k on CUDA with MultiaSyncDataCollector with both batch sizes or MultiSyncDataCollector with the larger batch size.

And in general, there's a big difference between running large batches and small batches with the torchrl collectors, and some numbers are a bit unexpected.

  • Running with CUDA is always better than running wth CPU and to me this is quite unexpected: the environments are running on CPU and there is no image transformation happening that could benefit from CUDA, so why would storing the tensors on CUDA be faster? @vmoens is the TensorDict overhead much faster with CUDA?
  • Decreasing the batch size almost always decreases the FPS (up to 3x slower). In Gym the observations generated are directly thrown away, but with the collectors, the observations are stored and the storing TensorDict has to be initialized, etc so the benchmark is not really fair (The gym part doesn't include preallocating space and moving the tensors to CUDA, etc), but this tells a lot about the overhead incurred by the collector. Maybe something can be further improved here?
  • @vmoens in the complete output with detailed FPS at each step, MultiaSyncDataCollector's FPS are varying so much (especially on CUDA). Anything I'm not doing correctly or does that come from the async nature of the collector?

Specs:

print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
2023.7.18 1.25.0 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] linux
# torchrl at commit @886635e
# Hardware
GPU 0: Tesla V100-SXM2-32GB
2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

❯ mamba list | grep torch
pytorch                   2.0.1           py3.10_cuda11.7_cudnn8.5.0_0    pytorch
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchrl-nightly           2023.7.18                pypi_0    pypi
torchtriton               2.0.0                     py310    pytorch
torchvision               0.15.2              py310_cu117    pytorch

Code

import time
import warnings
from argparse import ArgumentParser

from torchrl.collectors.collectors import (
    MultiaSyncDataCollector,
    MultiSyncDataCollector,
    RandomPolicy,
    SyncDataCollector,
)
from torchrl.envs import EnvCreator, ParallelEnv
from torchrl.envs.libs.gym import GymEnv
import gymnasium as gym

warnings.filterwarnings("ignore", category=UserWarning)

parser = ArgumentParser()
parser.add_argument("--num_workers", default=8, type=int, help="Number of workers.")
parser.add_argument(
    "--frames_per_batch",
    default=10_000,
    type=int,
    help="Number of frames collected in a batch. Must be "
    "divisible by the number of workers.",
)
parser.add_argument(
    "--total_frames",
    default=100_000,
    type=int,
    help="Total number of frames collected by the collector. Must be "
    "divisible by the number of frames per batch.",
)
parser.add_argument(
    "--log_every",
    default=10_000,
    type=int,
    help="Number of frames between each log.",
)
parser.add_argument(
    "--env",
    default="PongNoFrameskip-v4",
    help="Gym environment to be run.",
)

if __name__ == "__main__":
    args = parser.parse_args()
    num_workers = args.num_workers
    frames_per_batch = args.frames_per_batch

    print(
        f"Running {num_workers} envs with {frames_per_batch} frames per batch"
        f" (i.e. {frames_per_batch / num_workers} frames per env)."
    )

    # Test asynchronous gym collector
    def test_gym():
        env = gym.vector.AsyncVectorEnv(
            [lambda: gym.make(args.env) for _ in range(num_workers)]
        )
        env.reset()
        global_step = 0
        times = []
        start = time.time()
        print("Timer started.")
        for _ in range(args.total_frames // num_workers):
            env.step(env.action_space.sample())
            global_step += num_workers
            if global_step % int(frames_per_batch) == 0:
                times.append(time.time() - start)
                fps = frames_per_batch / times[-1]
                if global_step % args.log_every == 0:
                    print(f"FPS Gym AsyncVectorEnv at step {global_step}:", fps)
                start = time.time()
        env.close()
        print("FPS Gym AsyncVectorEnv mean:", args.total_frames / sum(times))

    # Test multiprocess TorchRL collector
    def test_torch_rl(collector_class, device):
        make_env = EnvCreator(lambda: GymEnv(args.env, device=device))
        if collector_class in [MultiSyncDataCollector, MultiaSyncDataCollector]:
            mock_env = make_env()
            collector = collector_class(
                [make_env] * num_workers,
                policy=RandomPolicy(mock_env.action_spec),
                total_frames=args.total_frames,
                frames_per_batch=frames_per_batch,
                device=device,
                storing_device=device,
            )
        elif collector_class in [SyncDataCollector]:
            parallel_env = ParallelEnv(args.num_workers, make_env)
            collector = SyncDataCollector(
                parallel_env,
                policy=RandomPolicy(parallel_env.action_spec),
                total_frames=args.total_frames,
                frames_per_batch=frames_per_batch,
                device=device,
                storing_device=device,
            )
        global_step = 0
        times = []
        start = time.time()
        print("Timer started.")
        for i, data in enumerate(collector):
            global_step += data.numel()
            times.append(time.time() - start)
            fps = frames_per_batch / times[-1]
            if global_step % args.log_every == 0:
                print(
                    f"FPS TorchRL with {collector_class.__name__} on {device} at step {global_step}:",
                    fps,
                )
            start = time.time()
        collector.shutdown()
        print(
            "FPS TorchRL with",
            collector_class.__name__,
            "on",
            device,
            "mean:",
            args.total_frames / sum(times),
        )

    test_gym()
    for collector_class in [MultiSyncDataCollector, MultiaSyncDataCollector, SyncDataCollector]:
        for device in ["cpu", "cuda:0"]:
            test_torch_rl(collector_class, device)

    exit()

Output

❯ python benchmarks/test_torchrl_vs_gym.py --num_workers=8 --frames_per_batch=2_000                                             
                                                                                                                                            
Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).                                                                      
...                                                                                                                
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                                                  
[Powered by Stella]                                                                                                                         
Timer started.                                                                                                                              
FPS Gym AsyncVectorEnv at step 10000: 11880.248747693306                                                                                    
FPS Gym AsyncVectorEnv at step 20000: 14411.756722164288                                                                                    
FPS Gym AsyncVectorEnv at step 30000: 11630.216962415272                                                                                    
FPS Gym AsyncVectorEnv at step 40000: 10516.125311837932                                                                                    
FPS Gym AsyncVectorEnv at step 50000: 9574.003693279737                                                                                     
FPS Gym AsyncVectorEnv at step 60000: 9945.684469115837                                                                                     
FPS Gym AsyncVectorEnv at step 70000: 11437.316106795084                                                                                    
FPS Gym AsyncVectorEnv at step 80000: 8848.28526795556                                                                                      
FPS Gym AsyncVectorEnv at step 90000: 11634.265615941727                                                                                    
FPS Gym AsyncVectorEnv at step 100000: 10506.484039768444                                                                                   
FPS Gym AsyncVectorEnv mean: 11587.749470459721                                                                                             
...                     
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                   
[Powered by Stella]                           
Timer started.                                
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 2834.0370614368485                                                                                                           
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 3115.0810538222336                                                                                                           
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 2827.5250466248276                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 3015.099926353219                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 2788.602794517342                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 3205.3090866295415                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 2482.831288036687                                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 2844.3913975942423                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 3150.4935152535104                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 3045.399594411799                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cpu mean: 2866.4653702739765                                                                                                                                                                   
...                            
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                    
[Powered by Stella]                                                           
Timer started.                                                                
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 5257.937418281604                                                                          
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 5208.714091009286                                                                          
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 5117.738499344772                                                                                             
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 5137.022029172594                                                                                             
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 5189.274791404789                                                                                             
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 5110.6078855326305                                                                                            
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 3927.3263204022173                                                                                            
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 5181.009037697941                                                                                             
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 5198.836110327817                                                                                             
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 5182.798409438583                                                                                            
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 5108.929550860561                                                                                                      
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                          
[Powered by Stella]                                                                                 
Timer started.                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 5337.133782600729                                                                                                                        
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 1227.4027505992447                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 4303.450257146888                                                                                                                        
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 14703.854837126471                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 1778.4022887743843                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 6623.503838558663                                                                                                                        
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 4115.28234834704                                                                                                                         
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 4658.867646209418                                                                                                                        
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 3634.1222290093533                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 6693.456638912273                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 4349.725160643495 
...                                                                                                                                                                                                                                                                                                                                                                
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                                                                                                                                                                                                                                                                                                                                                                                          
[Powered by Stella]                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
Timer started.                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 689399.0795529257                                                                                                                                                                                                                                                                                                                                                                                                 
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 18005.632254962566                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 264391.32627332327                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 130747.17498714132                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 776.6116486266598                                                                                                                                                                                                                                                                                                                                                                                                 
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 41273.37941892789                                                                                                                                                                                                                                                                                                                                                                                                 
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 49679.94646230745                                                                                                                                                                                                                                                                                                                                                                                                 
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 495107.59605736885                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 113989.59111847916                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 5840.644401694138                                                                                                                                                                                                                                                                                                                                                                                                
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 4745.486839045239                                                                                                                                                                
...                            
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with SyncDataCollector on cpu at step 10000: 689.0744194898969                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 20000: 548.245757072972                                                                                                                                                                 
FPS TorchRL with SyncDataCollector on cpu at step 30000: 622.7927557851568                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 40000: 565.5496544653071                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 50000: 549.4385496083208                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 60000: 589.2434754865648                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 70000: 630.9949199112759                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 80000: 624.1467153348757                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 90000: 497.86883899324306                                                                                                                                                               
FPS TorchRL with SyncDataCollector on cpu at step 100000: 504.7398666134446                                                                                                                                                               
FPS TorchRL with SyncDataCollector on cpu mean: 587.1912214549519                                                    
...                                                                    
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 3904.9437622718906                                                                                                                                                            
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 3288.8724744697624                                                                                                                                                            
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 3437.67936578839                                                                                                                                                              
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 3400.23242154962                                                                                                                                                              
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 3167.2741743898305                                                                                                                                                            
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 3348.8633532834688                                                                                                                                                            
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 3145.3018116045278                                                                                                                                                            
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 3288.962738376645                                                                                                                                                             
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 3215.286298039423                                                                                                                                                             
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 1305.8973618519733                                                                                                                                                           
FPS TorchRL with SyncDataCollector on cuda:0 mean: 3171.8938244285127                                                                                      

------------------------------------------------------------------------------------------

❯ python benchmarks/test_torchrl_vs_gym.py --num_workers=8 --frames_per_batch=80                                                                                                                                              

Running 8 envs with 80 frames per batch (i.e. 10.0 frames per env).                                                  

...                                                                                                 
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS Gym AsyncVectorEnv at step 10000: 14207.745268238981                                                             
FPS Gym AsyncVectorEnv at step 20000: 13789.681502486335                                                             
FPS Gym AsyncVectorEnv at step 30000: 9946.76943143416                                                               
FPS Gym AsyncVectorEnv at step 40000: 8407.314274260229                                                              
FPS Gym AsyncVectorEnv at step 50000: 12323.95489771183                                                              
FPS Gym AsyncVectorEnv at step 60000: 8352.275601135063                                                              
FPS Gym AsyncVectorEnv at step 70000: 12466.351612423838                                                             
FPS Gym AsyncVectorEnv at step 80000: 10677.283777763636                                                             
FPS Gym AsyncVectorEnv at step 90000: 13954.267653663812                                                             
FPS Gym AsyncVectorEnv at step 100000: 7710.826362717162                                                             
FPS Gym AsyncVectorEnv mean: 11317.07864610709                                                                       
...                                                                     
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 1075.6177012710166                                                                                                                                                          
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 1057.8420350759939                                                                                                                                                          
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 943.7968531134151                                                                                                                                                           
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 1059.3114595731113                                                                                                                                                          
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 880.3610184077409                                                                                                                                                           
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 1175.6983882270497                                                                                                                                                          
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 1007.8040282808623                                                                                                                                                                                                                                                                                                               
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 951.8447747645523                                                                                                                                                           
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 1462.7104738904702                                                                                                                                                          
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 1253.7666695313287                                                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cpu mean: 979.1281444006343                                                                                                                                                                    
...                                                                                       
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 3048.4352826811787                                                                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 3262.14582928252                                                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 3217.5086060582817                                                                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 3344.70669152022                                                                                                                                                         
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 3387.694047330587                                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 3303.544515658997                                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 3303.2518212246505                                                                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 3550.6959714712016                                                                                                                                                       
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 3221.864689954487                                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 2741.64395202144                                                                                                                                                        
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 3225.940279205892                                                                                                                                                                 
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 5388.365878725591                                                                                                                                                          
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 27384.66661225822                                                                                                                                                          
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 2353.42530702708                                                                                                                                                           
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 657.7791369430935                                                                                                                                                          
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 43965.45073375262                                                                                                                                                          
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 2589.4961374914146                                                                                                                                                         
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 1340.4240054009356                                                                                                                                                         
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 47153.50196739742                                                                                                                                                          
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 40790.702650133724                                                                                                                                                         
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 9932.636315197442                                                                                                                                                         
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1700.2908940649854                                                                                                                                                                  
...                       
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 27906.214238190285                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 5383.3518369966305                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 19511.793917543757                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 75863.51345240787                                                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 3048.1029768447443                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 73875.89608102158                                                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 13952.526924196432                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 3091.463160707211                                                                                                                                                       
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 124367.79836916234                                                                                                                                                      
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 2116.4113432233326                                                                                                                                                     
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 5223.255705987036                                                                                                                                                                
...                        
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)                                                           
[Powered by Stella]                                                                                                  
Timer started.                                                                                                       
FPS TorchRL with SyncDataCollector on cpu at step 10000: 533.9423450260014                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 20000: 618.3473233980839                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 30000: 633.7098955225008                                                                                                                                                                
FPS TorchRL with SyncDataCollector on cpu at step 40000: 629.6678307571342                     
FPS TorchRL with SyncDataCollector on cpu at step 50000: 591.178977598069
FPS TorchRL with SyncDataCollector on cpu at step 60000: 410.9202417924778
FPS TorchRL with SyncDataCollector on cpu at step 70000: 488.02613322008165
FPS TorchRL with SyncDataCollector on cpu at step 80000: 605.4527213805746
FPS TorchRL with SyncDataCollector on cpu at step 90000: 695.1189003863564
FPS TorchRL with SyncDataCollector on cpu at step 100000: 81.004056910206
FPS TorchRL with SyncDataCollector on cpu mean: 547.1234236961403
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 3152.338058867187
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 2331.9178272593335
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 3397.9515741931564
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 1760.971529035136
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 2330.816337871631
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 2617.146244442711
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 2050.628368881012
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 2597.574781693194
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 2078.3301228251644
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 90.8393065937377
FPS TorchRL with SyncDataCollector on cuda:0 mean: 2424.713380369402

skandermoalla added a commit to skandermoalla/rl that referenced this issue Jul 19, 2023
@vmoens
Copy link
Contributor

vmoens commented Aug 7, 2023

Thanks guys for looking into this.
Here's a few things about the current state of collection speed in torchrl:

  • As noted by @skandermoalla the throughput with cuda is superior than with cpu. The explanation is simply that it's faster to write a tensor from ram to cuda than from ram to shared storage (on physical mem). Simply put, it's better to use cuda whenever you can, you can also benefit from the speedup of executing your model on device.
  • There is also some overhead caused by tensordict. We've been optimizing tensordict to the core so I'm not very optimistic regarding making the instantiation much faster but there are some tricks we can use. In a not so distant future we may look at a c++ backend for tensordict which would speed things up drastically, but the best we can do for now is to be patient regarding that.
  • When executing a rollout with a ParallelEnv with CartPole (which resets often) we spend as much time calling reset than step. In other words, we could somewhat make that env 2x faster if we could automatically reset the env locally on the remote process, rather than gathering data on the main process, reading the done state, sending a reset signal and waiting for it to happen. This would be a new feature that would not be immediate to code. I will have a look into this in the upcoming days, but it looks like it could speed things up drastically for both parallel envs and collectors.

Work plan

Here is what I'm envisioning for this:

  • Currently, when encountering a "done" state, what happens is that the ("next", "done") is set to True. This is read as a signal that reset should be called: when called, the root "obs" is rewritten with a new value.
  • I first thought about this: When creating an env, you have the option of saying Env(..., auto_reset=True). If so, when an env encounters a ("next", "done") == True (forgive me for the messy syntax), we deliver all the new data in the "next" key and update the root with the result of "reset". We must be careful when doing this because now the root of the tensordict is at episode e+1 and step 0 when the "next" nested tensordict is at step T and episode e. This is a bit annoying as we'll need to do some gymnastic to make sure that we still have data[... :1] == data["next"][..., :-1]. This will be a very heavy weight to carry in the code base on the long term.
  • My current preferred option (although it's gonna be a massive rework) is this: we change the signature of env.step to output 2 distinct get a rollout that works as this:
def rollout_autoreset(self):
    result = []
    next_result = []
    cur_data = env.reset()
    for i in range(T):
        _cur_data, next_data = env.step(cur_data)

        # cur_data and next_data are well synced
        result.append((cur_data, next_data))
        
        # now step_mdp chooses between _cur_data and next_data based on the done state.
        # with envs that have a non-empty batch size, it can mix them together
        cur_data = step_mdp(_cur_data, next_data)

    result, next_result = [torch.stack(r) for r in zip(*result)]
    result.set("next", next_result)
    return result

Why is this better?
For ParallelEnv, this would mean that we can just call env.step on each sub-env. The _cur_data buffer is synced between processes only if needed, and we never call _reset as the individual processes take care of that. Same for the collectors. Another plus side is that we have less creation of tensordicts during a rollout, which will further speed things up.

How do we get there?

This is going to be hugely bc-breaking so it will have to go through prototyping + deprecation message (0.2.0, a bit later this year) -> deprecation + possibility of using the old feature (0.3.0, early 2024) -> total deprecation (0.4.0, somewhere in 2024). I expect the speedup to bring the envs closer to par with gym in terms of rollout and bring the data collection using collector at a superior speed than all other regular loops when executed on device (across sync and async).

I will open a PoC soon, hoping to get some feedback!

cc @smorad @matteobettini @shagunsodhani

@smorad
Copy link
Contributor

smorad commented Aug 7, 2023

This sounds like a lot of work. Would it be easier to just wrap the gym env in the Autoreset wrapper?

@matteobettini
Copy link
Contributor

matteobettini commented Aug 14, 2023

It seems quite a complex solution.

  • the use of a tuple output in the step_and_reset_function kind of goes against the tensordict principle of signatures no? If I follow the tensordict philosophy, outputing one tensordict with a reset subtensordict would be more aligned with our tenets.
  • I do not understand why we need this new fuse_tensordict function. this is basically doing what step_mdp does. the simpler solution is then just to have step_mdp inside step_and_maybe_reset
def step_and_maybe_reset(td):

    tensordict = self._step(tensordict) # this is the current step in main. aka tensordict will have both root and next
    _reset = tensordict.get(("next",self.done_key)

    next_td = step_mdp(tensordict)
    reset_td = next_td.clone()
    reset_td.set("_reset", _reset)
    self.reset(reset_td)
    
    return reset_td, next_td # these will both contain data only from the next step, but one has been reset and the other not

alternatively you could also return the data passed as input as a third output.
Although, as I suggested in my first point, I think the output td should be one with multiple keys

@vmoens vmoens linked a pull request Sep 12, 2023 that will close this issue
@skandermoalla
Copy link
Contributor

Something really strange is happening with the current torchrl version since I ran this benchmark 3 months ago.
By running any of the TorchRL collectors with 8 envs I get 72 CPU maxed at 100% and bottlenecking everything else.
This does not happen with the Gymnasium "vectorized" envs.

@vmoens any quick intuition? I tried running as last time with a docker setup on the same cluster, but also natively on a GCP machine and the max-out CPU problem is still present. Can you reproduce this?
This does not happen on my M1 macOS.

Some logs of the new performance.

# python benchmark.py --num_workers=8 --frames_per_batch=2_000

FPS Gym AsyncVectorEnv mean: 5299.461355240903

FPS TorchRL with MultiSyncDataCollector on cpu mean: 1409.2643700762942
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 1434.0322984191869

FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1539.6185885727893
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 1473.2058280519127

FPS TorchRL with SyncDataCollector on cpu mean: 369.65728004849865
FPS TorchRL with SyncDataCollector on cuda:0 mean: 463.8032136205446


/mloraw1/moalla/open-source/torchrl/rl/benchmarks main*
implicit-pgpython skander.py --num_workers=8 --frames_per_batch=2_000 
Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS Gym AsyncVectorEnv at step 10000: 4331.723599469573
FPS Gym AsyncVectorEnv at step 20000: 5204.412883276483
FPS Gym AsyncVectorEnv at step 30000: 4185.867384286969
FPS Gym AsyncVectorEnv at step 40000: 5084.152751110335
FPS Gym AsyncVectorEnv at step 50000: 4690.31147382411
FPS Gym AsyncVectorEnv at step 60000: 6040.902036892344
FPS Gym AsyncVectorEnv at step 70000: 6695.075310147556
FPS Gym AsyncVectorEnv at step 80000: 6469.561408176754
FPS Gym AsyncVectorEnv at step 90000: 7322.151306509504
FPS Gym AsyncVectorEnv at step 100000: 5166.639771866397
FPS Gym AsyncVectorEnv mean: 5299.461355240903
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 1320.6028552739604
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 1446.4029971546481
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 1430.5873492922874
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 1550.6232173654255
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 1319.6976557084743
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 1365.2251196885925
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 1430.506355384124
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 1391.8008827041454
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 1450.1957484061622
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 1506.2325100349867
FPS TorchRL with MultiSyncDataCollector on cpu mean: 1409.2643700762942
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 1511.1733695354542
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 1394.4858962330331
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 1495.3765765560202
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 1542.618458368687
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 1522.7536971932934
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 1519.255640425426
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 1313.5330130432453
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 1442.4469610732624
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 1369.7093138217786
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 1410.1916111111952
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 1434.0322984191869
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 8251.768916064895
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 3297.1962499272845
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 2217.987399000709
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 8748.476584294975
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 586.1939663853168
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 2189.470475220034
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 953.2261035479872
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 3426.168919777176
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 4175.730351396986
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 2373.595130464702
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1539.6185885727893
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 16983.39440811451
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 8616.284707100422
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 10179.842993249122
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 30913.891079549223
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 295.2667884207772
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 3726.762953649857
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 3346.7536887226006
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 3940.057969326326
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 4978.213290415082
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 14703.31361465317
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 1473.2058280519127
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cpu at step 10000: 280.5230834569255
FPS TorchRL with SyncDataCollector on cpu at step 20000: 291.6941486528101
FPS TorchRL with SyncDataCollector on cpu at step 30000: 487.1544962849032
FPS TorchRL with SyncDataCollector on cpu at step 40000: 356.98548206803537
FPS TorchRL with SyncDataCollector on cpu at step 50000: 494.37040263102915
FPS TorchRL with SyncDataCollector on cpu at step 60000: 483.80378322362986
FPS TorchRL with SyncDataCollector on cpu at step 70000: 555.6786701032164
FPS TorchRL with SyncDataCollector on cpu at step 80000: 475.9004904517329
FPS TorchRL with SyncDataCollector on cpu at step 90000: 352.64654916557294
FPS TorchRL with SyncDataCollector on cpu at step 100000: 289.18409560006296
FPS TorchRL with SyncDataCollector on cpu mean: 369.65728004849865
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 356.0708200186096
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 416.6240826727289
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 411.62237630608746
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 448.2072555049812
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 451.1781603430556
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 505.79521035558116
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 529.77273714769
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 569.5611329452704
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 432.81666192415577
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 373.8837006455133
FPS TorchRL with SyncDataCollector on cuda:0 mean: 463.8032136205446

@vmoens
Copy link
Contributor

vmoens commented Sep 20, 2023

Weird, let me look into it.
Does it happen with both collectors and parallel envs?

@skandermoalla
Copy link
Contributor

Yes, the SyncDataCollector is using ParallelEnv inside.

@skandermoalla
Copy link
Contributor

I'm trying to revert to the previous version to see if it's my environment that changed or something in TorchRL.

@vmoens
Copy link
Contributor

vmoens commented Sep 20, 2023

Could be due to #1532 or something like that...

@vmoens
Copy link
Contributor

vmoens commented Sep 20, 2023

Oh have you tried torch.set_num_threads(1)?
Collection should be faster and you should also reduce the loading of your cpus.

If that solves your problem, I can write a set_num_threads decorator to be used with torchrl (bc I don't think that changing that in the main process every time you load torchrl is very wise :) )

@skandermoalla
Copy link
Contributor

I think you saved me hours/days of debugging and research time 🥹🥹

@skandermoalla
Copy link
Contributor

torch.set_num_threads(1) solved the issue! I hope #1559 will get a better tradeoff!

@vmoens
Copy link
Contributor

vmoens commented Sep 21, 2023

Cool! So that corroborates what I thought:

  • By default, torch takes num_threads = num_cpus
  • This is (arguably) ok on one worker, but if you add another it's trickier: both "see" the same number of cpus, and both allocate a number of threads equal to the number of cpus.
  • That quickly scales up and you end up in the situation you were referring too. It also explain why perf does not plateau when you add workers but actually decreases!
  • Note that as pointed by @matteobettini this number of threads is only related to torch operations (read and write tensors etc) so that should not impact what is happening anywhere else
  • [Feature] Threaded collection and parallel envs #1559 will hopefully cope with that at the cost of requiring users to tell how many threads they want at each level if it does not match the default. For instance, if you need 8 workers for the collector and 4 workers for something else, you need to set the number of threads to 12 and do so when you create the collector. Not sure it covers all use cases but at least it solves a big issue in torchrl!

@skandermoalla
Copy link
Contributor

skandermoalla commented Sep 21, 2023

Btw, I reran the benchmark with a 3-month old TorchRL/Tensordict versions and got the same issue when not specifying the number of threads. I think it's probably a change in the cluster I'm using.
With the num threads specified, I get a 2x improvement on the current version of T TorchRL/Tensordict. A good reminder of the improvements done since then!

@vmoens
Copy link
Contributor

vmoens commented Oct 4, 2023

Hi guys, #1602 contains a simple script we used for the paper to benchmark against gym async collector. Feel free to run it and give us feedback.
I merge it quickly to have it available to everyone, but it needs a bit of polishing!
We have quite good numbers on CPU when the env has few resets. If resets are frequent (eg, CartPole) we are still underperforming a bit. This is something we'll be tackling in v0.3!
Perf on cuda has deteriorated recently and I can't really figure out why atm, stay tune for more info!

@duburcqa
Copy link
Contributor

duburcqa commented Oct 8, 2023

Hi,

I was trying to find a way to maximize the throughput in TorchRL. After spending a lot of time trying many different configuration. For "cpu" only on a laptop with 8 physical cores (Intel i9-11900H), I had the best result using MultiSyncDataCollector with as many collecter as workers, each of them managing a single env. Here is the pseudo-code:

import os

os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['VECLIB_MAXIMUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

gym_make = lambda: gym.make(...)

train_env_creator = EnvCreator(lambda: GymWrapper(gym_make(), device="cpu"))

collector = MultiSyncDataCollector(
    N_WORKERS * (train_env_creator,),
    RandomPolicy(env.action_spec),
    frames_per_batch=80000,
    total_frames=40000000,
    preemptive_threshold=0.8,
    device="cpu",
    storing_device="cpu",
    num_threads=N_WORKERS,
    num_sub_threads=1
)

It is running at about 50% of the maximum theoretical speed (8 completely independent processes collecting samples on their own). I did some profiling with py-spy top (using a similar benchmark as the one proposed in this thread), it shows where about 20% of the total computation time is going:

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                  
 51.00%  75.00%   220.0s    302.2s   step (jiminy_py/simulator.py)
  8.00%  10.00%   18.86s    20.93s   compute_command (common/blocks/proportional_derivative_controller.py)
  1.00%   1.00%   13.40s    13.41s   refresh_observation (common/blocks/mahony_filter.py)
  4.00%   5.00%   12.70s    31.29s   refresh_observation (common/bases/pipeline_bases.py)
  5.00%  17.00%   11.73s    45.57s   compute_command (common/bases/pipeline_bases.py)
  3.00%   3.00%   11.15s    11.20s   encode (torchrl/data/tensor_specs.py)
  2.00%   2.00%   10.35s    12.31s   inner_terminated_or_truncated (torchrl/envs/utils.py)
  2.00%   7.00%    9.42s    40.71s   _observer_handle (common/bases/generic_bases.py)
  2.00%   2.00%    8.82s     9.33s   compute_command (common/blocks/motor_safety_limit.py)
  2.00%  78.00%    6.87s    321.8s   step (common/envs/env_generic.py)
  1.00%   1.00%    5.99s     5.99s   __new__ (tensordict/tensordict.py)
  3.00%   6.00%    4.66s    13.74s   __init__ (tensordict/tensordict.py)
  2.00%   2.00%    4.33s     4.35s   _convert_to_tensor (tensordict/tensordict.py)
  0.00%   0.00%    3.87s     3.87s   refresh_observation (common/envs/env_generic.py)
  0.00%   0.00%    3.54s     3.54s   _reduce (common/utils/spaces.py)
  0.00%   0.00%    3.43s     3.43s   _forward (common/utils/spaces.py)
  1.00%   3.00%    3.23s     8.48s   _validate_value (tensordict/tensordict.py)
  0.00%   1.00%    2.86s     7.94s   _set (torchrl/envs/utils.py)
  0.00%   0.00%    2.71s     2.73s   __contains__ (tensordict/utils.py)
  0.00%  92.00%    2.67s    374.1s   _step (torchrl/envs/gym_like.py)
  0.00%   4.00%    2.63s    11.59s   _set_str (tensordict/tensordict.py)
  1.00%   1.00%    2.56s     4.44s   get (tensordict/tensordict.py)
  1.00%   5.00%    2.51s    10.87s   to (tensordict/tensordict.py)
  2.00%  80.00%    2.32s    325.8s   step (common/bases/pipeline_bases.py)
  0.00%   0.00%    2.29s     3.63s   __iter__ (torchrl/data/tensor_specs.py)
  1.00%   1.00%    2.09s     2.09s   to_numpy (torchrl/data/tensor_specs.py)
  1.00%   1.00%    2.07s     2.07s   _numba_unpickle (numba/core/serialize.py)
  0.00%   0.00%    2.02s     2.02s   zero (torchrl/data/tensor_specs.py)
  1.00%  24.00%    1.88s    82.43s   _controller_handle (common/bases/generic_bases.py)
  0.00%   5.00%    1.82s     8.99s   apply (tensordict/tensordict.py)
  0.00%   0.00%    1.69s     1.69s   items (torchrl/data/tensor_specs.py)
  0.00%   0.00%    1.63s     1.63s   __getitem__ (torchrl/data/tensor_specs.py)
  0.00%   0.00%    1.49s     1.55s   _get_str (tensordict/tensordict.py)
  0.00% 100.00%    1.48s    433.9s   rollout (torchrl/collectors/collectors.py)
  1.00%   3.00%    1.45s    11.10s   set (tensordict/tensordict.py)
  2.00%   2.00%    1.42s     6.45s   clone (tensordict/tensordict.py)
  0.00%   2.00%    1.41s    12.43s   step_mdp (torchrl/envs/utils.py)
  0.00%   2.00%    1.29s     5.25s   _step_proc_data (torchrl/envs/common.py)
  0.00%   0.00%    1.26s     1.28s   is_shared (tensordict/tensordict.py)
  0.00%  94.00%    1.22s    382.1s   step (torchrl/envs/common.py)
  0.00%   1.00%    1.20s     2.25s   _complete_done (torchrl/envs/common.py)
  0.00%   0.00%    1.18s     5.69s   reward_keys (torchrl/envs/common.py)
  0.00%   2.00%    1.18s    18.28s   terminated_or_truncated (torchrl/envs/utils.py)
  0.00%   0.00%    1.15s     1.32s   start (jiminy_py/simulator.py)
  0.00%   6.00%    1.14s    44.11s   _step_and_maybe_reset (torchrl/collectors/collectors.py)

Is there anything I could do to spend less time doing processing in TorchRL ? Especially these bottlenecks:

  3.00%   3.00%   11.15s    11.20s   encode (torchrl/data/tensor_specs.py)
  2.00%   2.00%   10.35s    12.31s   inner_terminated_or_truncated (torchrl/envs/utils.py)
  1.00%   1.00%    5.99s     5.99s   __new__ (tensordict/tensordict.py)
  3.00%   6.00%    4.66s    13.74s   __init__ (tensordict/tensordict.py)
  2.00%   2.00%    4.33s     4.35s   _convert_to_tensor (tensordict/tensordict.py)

Since I have implemented the gym environment from scratch, I can do whatever modification is necessary. For instance, returning TensorDict objects directly if this would help. I could also implement a dedicated interface with TorchRL to avoid copies and stuffs whenever possible. Typically, my env only allocates memory at init and then only set values later on.

Still, there are about 30% of the computation time that is spent somewhere else, but I still need to understand where. Actually, it seems to be stuck waiting for something about half of the time, but it is not clear to me why. It seems to be related to the use of multiprocessing.Pipe for sending/receving messages between the main thread and subprocesses. So it is probably due to the time required to send the observation and stuff back to the main process:
image

For completeness, here are the complete profiling with 'file_descriptor' and 'file_system' multiprocessing sharing strategies respectively:
file_descriptor
file_system

Still, it is better than plain gym.vector.AsyncVectorEnv, for which I'm having a 65% slow-down wrt the maximum theoretical throughput. So maybe it is just due to inter-process communication and there is not much we can do about it. By the way, I'm getting similar results with RayCollector.

@vmoens
Copy link
Contributor

vmoens commented Oct 10, 2023

It's a rather long and thorough thread (thanks for it!) so I'll answer piece by piece.

It is running at about 50% of the maximum theoretical speed (8 completely independent processes collecting samples on their own).

That doesn't surprise me (unfortunately).
The "bet" that TorchRL is implicitly making is that truly vectorized environments (Isaac, VMAS, Brax etc) will eventually be the real thing and overhead caused by the various methods you're pointing at here will become less relevant in the grand scheme of things. Besides, one day we could get torch.compile to reduce this even further.

Since I have implemented the gym environment from scratch, I can do whatever modification is necessary. For instance, returning TensorDict objects directly if this would help. I could also implement a dedicated interface with TorchRL to avoid copies and stuffs whenever possible. Typically, my env only allocates memory at init and then only set values later on.

In this case I think the best could be to simply code your environment from EnvBase. It could be simpler.
The interface with gym can be complex to handle. Here are some anecdotal examples:

  • (decode) During rendering, some np.ndarray objects have a negative stride and must be copied to be transformed into tensor. Hence we have to check the stride of each array to make sure that this does not happen.
  • (encode) Some actions are np.ndarray, some are simple integers... We have to do a bunch of checks to cover all use cases when we see an action coming, there's no way to tell what it's gonna be in advance.

There's a lot more there we need to account for and every time someone comes with an env-specific bug we need a patch that eats up a bit of compute time...

Something odd (IMO) is that many users, even close collaborators, only see our envs as "wrappers" when "wrappers" are only a fraction of what envs are. I'm super biased to take this with a massive pinch of salt but in some way I think EnvBase could serve as a base for pytorch-based env like gym.Env historically did with numpy-based envs.

Regarding inner_terminated_or_truncated, we can speed it up too for sure.

Still, there are about 30% of the computation time that is spent somewhere else, but I still need to understand where. Actually, it seems to be stuck waiting for something about half of the time, but it is not clear to me why. It seems to be related to the use of multiprocessing.Pipe for sending/receving messages between the main thread and subprocesses. So it is probably due to the time required to send the observation and stuff back to the main process:

That's communication overhead I think. Simply writing and reading tensors... We made good progress speeding this up (eg. using mp.Event in parallel processes) but there's always more to do!

If you don't need sync data (eg, off-policy) MultiaSync is usually faster than MultiSync.

Back in the days sharing tensors with CUDA devices was way faster than CPU (shared mem) but for some reason the balance has now shifted to cpu. I have no idea why!

Is there any way you could share the script and the method you're using for benchmarking, for reproducibility?

Thanks!

@duburcqa
Copy link
Contributor

duburcqa commented Jan 6, 2024

Here is a complete reproduction script. It requires installing gym_jiminy[all] via pip first:

import os

os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['VECLIB_MAXIMUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'

from functools import reduce, partial

import numpy as np
from tqdm import tqdm

from gym_jiminy.envs import AtlasPDControlJiminyEnv
from gym_jiminy.common.wrappers import (FilterObservation,
                                        NormalizeAction,
                                        NormalizeObservation,
                                        FlattenAction,
                                        FlattenObservation)

from torchrl.collectors import MultiSyncDataCollector
from torchrl.envs.libs.gym import GymWrapper
from torchrl.envs import EnvCreator

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')


N_ENVS = 16
N_WORKERS = 16


class ZeroPolicy:
    def __init__(self, action_spec, action_key = "action"):
        self.action_spec = action_spec
        self.action_key = action_key

    def __call__(self, td):
        return td.set(self.action_key, self.action_spec.zero())

if __name__ == '__main__':
    # Fix weird issue with multiprocessing
    __spec__ = None

    # Define the learning environment
    gym_make = lambda: reduce(
        lambda env, wrapper: wrapper(env),
        (
            partial(FlattenObservation, dtype=np.float32),
            NormalizeObservation,
            partial(FlattenAction, dtype=np.float32),
            NormalizeAction,
        ),
        FilterObservation(
            AtlasPDControlJiminyEnv(),
            nested_filter_keys=(
                ('states', 'pd_controller'),
                ('measurements', 'EncoderSensor'),
                ('features', 'mahony_filter'),
            )
        )
    )
    env_creator = EnvCreator(
        lambda: GymWrapper(gym_make(), device="cpu"))

    # Instantiate a dummy environment
    dummy_env = env_creator()

    # Instantiate and configure the data collector
    collector = MultiSyncDataCollector(
        N_WORKERS * (env_creator,),
        ZeroPolicy(dummy_env.action_spec),
        frames_per_batch=80000,
        total_frames=4000000,
        # preemptive_threshold=0.8,
        device="cpu",
        storing_device="cpu",
        # num_threads=1,
        num_sub_threads=1
    )
    frames_per_batch = collector.frames_per_batch_worker * collector.num_workers

    # Collect data
    pbar = tqdm(total=collector.total_frames, unit=" frames")
    for data_splitted in collector:
        pbar.update(frames_per_batch)

    # Stop the data collector
    collector.shutdown()

I run it on a new machine (Apple M3 Max) with the latest torchrl release available on pypi and the results encouraging compared to a few months ago. This time, I'm adding a slow down of about 25%. As you suggested, it may not be possible to get better performance because of interprocess communication:

profile_torchrl

@duburcqa
Copy link
Contributor

duburcqa commented Jan 6, 2024

After looking more closely, I realised that most of the slowdown was coming from the generic wrapper around gymnasium environments instead of interprocess communication:

profile_torchrl_single

If I run the same benchmark with a singleEnvCreator(GymWrapper(...)) torchrl environment without collector, it is already running 18% slower than when dealing with the plain gym.Env environment returned by gymnasium.make. The slowdown reaches 25% if I pipe the torchrl environment with a SyncDataCollector to collect samples:

    %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                             
    70.00%  82.00%   172.6s    222.0s   step (jiminy_py/simulator.py)
    3.00%   3.00%   11.23s    12.61s   compute_command (common/blocks/proportional_derivative_controller.py)
    2.00%   3.00%    8.02s    18.55s   refresh_observation (common/bases/pipeline_bases.py)
    2.00%   6.00%    7.32s    28.18s   compute_command (common/bases/pipeline_bases.py)
->  0.00%   4.00%    6.47s    10.78s   __init__ (tensordict/tensordict.py)
    1.00%   1.00%    6.24s     6.55s   compute_command (common/blocks/motor_safety_limit.py)
    1.00%   1.00%    6.07s     6.09s   refresh_observation (common/blocks/mahony_filter.py)
    1.00%   4.00%    5.25s    23.80s   _observer_handle (common/bases/generic_bases.py)
->  1.00%   2.00%    3.93s     5.29s   inner_terminated_or_truncated (torchrl/envs/utils.py)
    0.00%  83.00%    3.56s    232.9s   step (common/envs/env_generic.py)
    0.00%   0.00%    3.47s     3.47s   refresh_observation (common/envs/env_generic.py)
->  1.00%   1.00%    2.75s     2.81s   encode (torchrl/data/tensor_specs.py)
->  0.00%   0.00%    2.56s     2.56s   __new__ (tensordict/tensordict.py)
    0.00%   0.00%    2.23s     3.09s   _reduce (common/utils/spaces.py)
    0.00%  83.00%    2.22s    236.3s   step (common/bases/pipeline_bases.py)
->  1.00%   2.00%    2.02s    10.65s   _set (torchrl/envs/utils.py)
->  3.00%   7.00%    1.90s     6.39s   _set_str (tensordict/tensordict.py)
->  2.00%   4.00%    1.84s     4.08s   _validate_value (tensordict/tensordict.py)
->  0.00%   0.00%    1.79s     3.33s   get (tensordict/tensordict.py)
->  2.00%   2.00%    1.61s     1.68s   _convert_to_tensor (tensordict/tensordict.py)
    0.00%   0.00%    1.60s     1.60s   _forward (common/utils/spaces.py)
->  0.00%   0.00%    1.44s     1.44s   __getitem__ (torchrl/data/tensor_specs.py)
    2.00%  12.00%    1.44s    49.58s   _controller_handle (common/bases/generic_bases.py)
    0.00%   0.00%    1.33s     1.47s   start (jiminy_py/simulator.py)
->  0.00%   0.00%    1.27s     1.49s   _aggregate_end_of_traj (torchrl/envs/utils.py)
->  0.00% 100.00%    1.11s    290.6s   decorate_context (torch/utils/_contextlib.py)
->  1.00%  91.00%    1.08s    250.6s   _step (torchrl/envs/gym_like.py)
    0.00%   0.00%    1.03s     1.03s   _numba_unpickle (numba/core/serialize.py)
->  0.00%   0.00%   0.870s     1.61s   _complete_done (torchrl/envs/common.py)
->  0.00%   0.00%   0.870s    0.950s   _get_str (tensordict/tensordict.py)
->  0.00%   3.00%   0.810s    13.71s   step_mdp (torchrl/envs/utils.py)
->  0.00%   0.00%   0.780s     1.54s   _get_tuple (tensordict/tensordict.py)
->  0.00%   7.00%   0.770s     6.68s   set (tensordict/tensordict.py)
->  0.00%   0.00%   0.700s     2.38s   _update_traj_ids (torchrl/collectors/collectors.py)
    0.00%   0.00%   0.670s    0.680s   __call__ (llvmlite/binding/ffi.py)
->  0.00%   0.00%   0.640s     2.04s   clone (tensordict/tensordict.py)
->  0.00%   0.00%   0.600s    0.600s   to_numpy (torchrl/data/tensor_specs.py)
->  1.00%   2.00%   0.590s     3.09s   read_obs (torchrl/envs/gym_like.py)
->  1.00%   1.00%   0.560s     7.28s   select (tensordict/tensordict.py)
->  1.00% 100.00%   0.550s    289.3s   rollout (torchrl/collectors/collectors.py)
->  0.00%   0.00%   0.540s     1.81s   _stack_onto_ (tensordict/tensordict.py)
    0.00%   0.00%   0.520s    0.520s   unwrapped (common/bases/pipeline_bases.py)
->  0.00%   1.00%   0.510s     2.85s   _step_proc_data (torchrl/envs/common.py)
->  0.00%   0.00%   0.490s    0.490s   zero (torchrl/data/tensor_specs.py)
    0.00%   0.00%   0.490s    0.920s   _setup (common/envs/env_locomotion.py)
->  0.00%   0.00%   0.480s    0.480s   items (torchrl/data/tensor_specs.py)
->  0.00%  98.00%   0.440s    279.8s   step_and_maybe_reset (torchrl/envs/common.py)
    0.00%   0.00%   0.440s     1.33s   _call_with_frames_removed (<frozen importlib._bootstrap>)
->  0.00%   0.00%   0.430s    0.840s   _check_keys (tensordict/tensordict.py)
    0.00%   0.00%   0.430s    0.990s   deepcopy (copy.py)
    0.00%   0.00%   0.420s    0.420s   reset (jiminy_py/simulator.py)
->  0.00%   1.00%   0.420s    0.540s   is_tensor_collection (tensordict/tensordict.py)
    0.00%   0.00%   0.400s    0.400s   compute_reward (common/bases/generic_bases.py)
->  0.00%   0.00%   0.390s    0.460s   keys (tensordict/utils.py)
    0.00%   0.00%   0.380s    0.680s   is_training (common/bases/pipeline_bases.py)
    0.00%   0.00%   0.380s    0.380s   __iter__ (<frozen _collections_abc>)
->  0.00%   7.00%   0.380s     5.91s   _set_tuple (tensordict/tensordict.py)
->  0.00%  93.00%   0.370s    254.7s   step (torchrl/envs/common.py)
->  0.00%   0.00%   0.370s    0.560s   <listcomp> (tensordict/tensordict.py)
->  0.00%   0.00%   0.340s    0.340s   del_ (tensordict/tensordict.py)
    0.00%   0.00%   0.330s    0.330s   compute_command (common/envs/env_generic.py)
    1.00%   1.00%   0.310s     2.94s   has_terminated (common/envs/env_locomotion.py)
->  1.00%   1.00%   0.310s    0.310s   _is_tensor_collection (tensordict/tensordict.py)
    0.00%   0.00%   0.300s    0.300s   is_training (common/envs/env_generic.py)
    0.00%   0.00%   0.290s    0.290s   _keep_alive (copy.py)
->  0.00%   0.00%   0.280s    0.420s   __iter__ (torchrl/data/tensor_specs.py)

It shows that at least 43.26s is spent in torchrl for a total running time of 290.6s, namely 15%. I guess it is not summing up to 30% as I mentioned earlier because of profiling that is distorting the statistics.

From this standpoint, it is not clear to me if anything can be done to speed things up.

@vmoens
Copy link
Contributor

vmoens commented Jan 15, 2024

Good to know thanks for investigating this!
We should defo make these wrappers more efficient!
To be open about it, even if I acknowledge that there is room for improvement, I wonder if our time would not be better spent making the next gen of simulators (that are truly vectorized, like mujoco-mjx and isaac) work without effort with torchrl rather than optimizing for envs that are slow by nature. Happy to hear what your take is on that!

@duburcqa
Copy link
Contributor

duburcqa commented Jan 15, 2024

In my way, there is no such thing as "next gen of simulators" in the real world. Both single cpu-based and vectorized gpu simulators are still and will remain relevant for their own respective applications.

First, only classic cpu mode is relevant for all but RL applications, since in the vast majority of use cases you are only willing to run a single simulation at a time, not to mention critical embedded software. Yet, it is critical to use the same simulator over the whole pipeline, from RL training to classical offline planning algorithm and online model-based predictive control. Not only because it is the only way to make fair comparison between methods without doing real experiments, but also because fine tuning several simulators to make them as realistic as possible for a given use case is too much effort. Since I don't think it is realistic to expect from a simulator to support both vectorized gpu mode and classic cpu mode, classic cpu-based simulation may be the only viable option in practice.

Apart from that, in various real-world training scenarios, the actual simulated system may change internally between episodes, for instance to challenge the same policy on different models for the same physical platform (eg broken parts) as an advanced form of domain randomization. In such a case, batched gpu simulation in not applicable.

Next, running cpu-based simulations in parallel is already fast enough for real-world R&D on complex systems such as humanoid robotics. For instance, it takes only 1h to collect 100M timesteps on a Macbook Pro M3 when training locomotion on Boston Dynamics' Atlas robot. No need to go faster if you cannot iterate faster because analyzing the results is time consuming anway.

Finally, many complex algorithms are not yet ready to be integrated in batched gpu simulators, ie complex mesh-mesh collision detection algorithms, so if you want to perform a very realistic simulation you need to fallback to cpu-based simulator. Mujoco and Isaac certainly didn't rise to fame on the basis of how realistic they are.

To wrap up, cpu-based simulation is definitely nowhere near dead to me.

@ShaneFlandermeyer
Copy link
Author

I second @duburcqa's comment. Simulators like Issac and the new mujoco are awesome if you're doing pure RL algorithm development, but seem less useful for applied RL research where you spend most of your time making custom environments. In my use case, CPU environments provide a reasonable trade-off between simulation speed and development time early on in the design process, which is an important first step that should not be overlooked IMO.

Just my perspective from the applications side of things.

@matteobettini
Copy link
Contributor

I also agree. Especially in the field of robotics it is always important to remember that the real world is not vectorized and online learning in the real world is something that will gain increasing attention.

@vmoens
Copy link
Contributor

vmoens commented Jan 15, 2024

Thanks all for the valuable feedback!

Those a really valid points.

So what's your take on this topic for torchrl then? What's the best way forward?

The overhead observed by @duburcqa is hard to solve because gym is not very explicit about what it returns: unlike torchrl I can't tell in advance if I will have a info dict or not, if my obs dictionary is complete or not, I can't even tell if my reward is a float or a numpy.ndarray... For these reasons we have to do multiple checks.

It was once suggested to me that we could do these checks for the first iteration and then stick by it with some sort of compiled code but I don't really see how to make that happen in a simple way.

One option is to document how to write a custom gym wrapper with no checks to improve the runtime.

I also agree. Especially in the field of robotics it is always important to remember that the real world is not vectorized and online learning in the real world is something that will gain increasing attention.

For sure but I don't think that applies to this case, where you'd wrap a gym env in torchrl and do checks over the types and devices etc. If you're working with a robot you will most likely have your own environment tailored for that use case. In other words, I don't think that this impacts whether or not we should dedicate a lot of effort to bridge a potential 20% runtime gap compared to gym async envs.

@vmoens
Copy link
Contributor

vmoens commented Jan 15, 2024

@duburcqa I forgot to ask: are you using tensordict nightlies or the latest stable version?

@duburcqa
Copy link
Contributor

duburcqa commented Jan 15, 2024

It was once suggested to me that we could do these checks for the first iteration and then stick by it with some sort of compiled code but I don't really see how to make that happen in a simple way.

All of this could be done only once, at init, since it is reasonable to expect that types do not change over steps. This way, it would not add any runtime cost. Ideally, the whole computation path should be defined statically once and for all, then called whenever it is necessary. here is an example where I do this. I agree it is quite tricky to implement but the performance benefit can be very significant for the hot path. Still, maybe it is not necessary to go this far and there is a trade-off between full static computation path and full runtime path.

One option is to document how to write a custom gym wrapper with no checks to improve the runtime.

I'm clearly fine with it !

If you're working with a robot you will most likely have your own environment tailored for that use case. In other words, I don't think that this impacts whether or not we should dedicate a lot of effort to bridge a potential 20% runtime gap compared to gym async envs.

I agree this is not the most convincing argument.

I forgot to ask: are you using tensordict nightlies or the latest stable version?

The latest stable, but I could use something else if you want.

@duburcqa
Copy link
Contributor

duburcqa commented Feb 4, 2024

For the record, here are the result of my benchmark for latest release (0.3.x):

    %Own   %Total  OwnTime  TotalTime  Function (filename)                      
   71.00%  86.00%   176.6s    225.9s   step (jiminy_py/simulator.py)
    4.00%   5.00%   12.19s    13.27s   compute_command (common/blocks/proportional_derivative_controller.py)
    3.00%   9.00%    7.59s    30.21s   compute_command (common/bases/pipeline_bases.py)
    0.00%   0.00%    7.18s     7.60s   compute_command (common/blocks/motor_safety_limit.py)
    0.00%   4.00%    6.89s    17.33s   refresh_observation (common/bases/pipeline_bases.py)
    1.00%   1.00%    6.10s     6.12s   refresh_observation (common/blocks/mahony_filter.py)
->  0.00%   0.00%    5.10s     6.76s   _apply_nest (tensordict/_td.py)
    1.00%   5.00%    4.27s    21.60s   _observer_handle (common/bases/generic_bases.py)
    0.00%  87.00%    3.82s    237.2s   step (common/envs/env_generic.py)
->  0.00%   0.00%    3.75s     5.17s   inner_terminated_or_truncated (torchrl/envs/utils.py)
    2.00%   2.00%    3.41s     3.41s   refresh_observation (common/envs/env_generic.py)
->  0.00%   0.00%    2.82s     2.87s   encode (torchrl/data/tensor_specs.py)
    0.00%   0.00%    2.20s     2.97s   _reduce (common/utils/spaces.py)
->  0.00%   0.00%    2.00s     4.32s   _set (torchrl/envs/utils.py)
->  0.00%   1.00%    1.95s     3.09s   get (tensordict/base.py)
->  1.00%   1.00%    1.83s     1.83s   read_done (torchrl/envs/gym_like.py)
    0.00%  87.00%    1.68s    239.7s   step (common/bases/pipeline_bases.py)
    2.00%  15.00%    1.58s    49.60s   _controller_handle (common/bases/generic_bases.py)
    1.00%   1.00%    1.53s     1.56s   _forward (common/utils/spaces.py)
->  1.00%   1.00%    1.52s     1.52s   __getitem__ (torchrl/data/tensor_specs.py)
->  0.00%   1.00%    1.50s     2.48s   _validate_value (tensordict/base.py)
->  2.00%  91.00%    1.48s    258.9s   _step (torchrl/envs/gym_like.py)
->  0.00%   0.00%    1.19s     1.46s   _aggregate_end_of_traj (torchrl/envs/utils.py)
    2.00%   2.00%    1.16s     1.29s   start (jiminy_py/simulator.py)
->  1.00%   1.00%    1.03s     8.16s   to (tensordict/_td.py)
->  0.00%   1.00%   0.980s     3.46s   _set_str (tensordict/_td.py)
->  0.00%   0.00%   0.950s    0.950s   __init__ (tensordict/_td.py)
    1.00%   1.00%   0.950s    0.950s   _numba_unpickle (numba/core/serialize.py)
    1.00%   1.00%   0.940s    0.940s   __subclasscheck__ (<frozen abc>)
->  0.00%   0.00%   0.860s    0.860s   zero (torchrl/data/tensor_specs.py)
    0.00% 100.00%   0.860s    291.0s   decorate_context (torch/utils/_contextlib.py)
->  0.00%   0.00%   0.840s     1.77s   _complete_done (torchrl/envs/common.py)
->  0.00%   1.00%   0.830s     2.48s   _update_traj_ids (torchrl/collectors/collectors.py)
->  0.00%   0.00%   0.810s     5.93s   step_mdp (torchrl/envs/utils.py)
->  0.00%   0.00%   0.780s    0.780s   items (torchrl/data/tensor_specs.py)
->  1.00%   1.00%   0.780s    0.860s   _get_str (tensordict/_td.py)
->  0.00%   0.00%   0.740s     1.83s   _stack_onto_ (tensordict/_td.py)
->  0.00%   0.00%   0.670s    0.670s   _parse_to (tensordict/utils.py)
->  0.00% 100.00%   0.670s    290.2s   rollout (torchrl/collectors/collectors.py)
    0.00%   0.00%   0.610s    0.640s   __call__ (llvmlite/binding/ffi.py)
    0.00%   0.00%   0.600s     1.07s   deepcopy (copy.py)
->  0.00%   1.00%   0.590s     1.14s   _get_tuple (tensordict/_td.py)
->  0.00%   1.00%   0.570s     3.09s   _step_proc_data (torchrl/envs/common.py)
    0.00%   0.00%   0.560s     1.98s   _call_with_frames_removed (<frozen importlib._bootstrap>)
    1.00%   1.00%   0.530s    0.530s   reset (jiminy_py/simulator.py)
->  0.00%   0.00%   0.520s    0.520s   to_numpy (torchrl/data/tensor_specs.py)
->  0.00%   0.00%   0.480s     3.35s   read_obs (torchrl/envs/gym_like.py)
    1.00%   1.00%   0.470s     1.15s   _setup (common/envs/env_locomotion.py)
->  0.00%   0.00%   0.460s    0.500s   _clone_value (tensordict/utils.py)
->  0.00%  93.00%   0.450s    264.0s   step (torchrl/envs/common.py)
    0.00%   0.00%   0.450s    0.450s   __iter__ (<frozen _collections_abc>)
    0.00%   0.00%   0.410s    0.410s   unwrapped (common/bases/pipeline_bases.py)
->  0.00%   0.00%   0.410s    0.410s   _exclude (tensordict/_td.py)
->  0.00%   0.00%   0.390s    0.490s   is_tensor_collection (tensordict/base.py)
    0.00%   0.00%   0.380s    0.550s   is_training (common/bases/pipeline_bases.py)
->  0.00%  97.00%   0.380s    280.2s   step_and_maybe_reset (torchrl/envs/common.py)
    0.00%   0.00%   0.370s    0.460s   compute_reward (common/envs/env_locomotion.py)
    0.00%   0.00%   0.360s    0.370s   __instancecheck__ (<frozen abc>)
->  0.00%   1.00%   0.350s     3.28s   set (tensordict/base.py)
    0.00%   0.00%   0.340s    0.340s   compute_command (common/envs/env_generic.py)
->  0.00%   0.00%   0.340s    0.650s   <listcomp> (tensordict/_td.py)
    0.00%   0.00%   0.330s    0.430s   compute_transform_contact (jiminy_py/dynamics.py)
->  0.00%   0.00%   0.320s     5.98s   _terminated_or_truncated (torchrl/envs/utils.py)
->  0.00%   0.00%   0.320s    0.650s   _check_keys (tensordict/utils.py)
    0.00%   0.00%   0.310s    0.880s   _deepcopy_dict (copy.py)
    0.00%   1.00%   0.300s     2.11s   _call_impl (torch/nn/modules/module.py)

It shows that 38.66s is spent in torchrl/tensordict methods for a total running time of 291.0s, namely 13.3%. It was 43.26s (14.8%) on 0.2.x release. So it is slightly better than before but not a game changer. But overall, it was and still is fairly acceptable.

@vmoens
Copy link
Contributor

vmoens commented Feb 4, 2024

What do you include in those 38s?
I see to is taking 8 secs but for instance that's not something TensorDict could really help with (except if we revert to non_blocking=True which will be part of 0.3.1)

@duburcqa
Copy link
Contributor

duburcqa commented Feb 4, 2024

What do you include in those 38s?

The OwnTime of all lines marked with ->, ie related to either torchrl or tensordict.

@vmoens
Copy link
Contributor

vmoens commented Feb 4, 2024

Got it
Unfortunately here I don't see a single bottleneck that would bring you a significant speed up. Even if we reduce _apply_nest to 0 you will only gain 5 secs. I don't mean the work is done but I wouldn't expect a couple of PRs to improve things by an order of magnitude. I'll keep an eye on that trace and make my best to improve things!

Besides, correct me if I'm wrong but I think that being around 10% of runtime for torchrl can be a hit that most people are ready to take, and this for two reasons:

  • runtime isn't everything, when it comes to dev time you also want to be faster in coming up with solutions to your problems
  • more importantly measuring the runtime of torchrl will always give you the impression that we incur some extra cost that you would not have had with anything else. What it does not show is all the bits and pieces of code where torchrl and TensorDict speed things up for you compared to a naive implementation. For instance, there are plenty of tricks we use (like the non_blocking above, using torch.where, vectorizing value functions etc) where another code would be slower but a profile would just not show any torchrl or tensordict overhead.

@duburcqa
Copy link
Contributor

duburcqa commented Feb 4, 2024

Unfortunately here I don't see a single bottleneck that would bring you a significant speed up.

Yes, indeed. I don't think it is worth the effort at this point. Yet, the issue with profiling is that it alters the original timing. I'm observing a slowdown due to torchrl data collection against running episodes without torch and throwing away all the samples that is twice larger if with profiling disabled than enabled. I need to check again but I'm expecting an actual slowdown close to 25% rather than 13% on a real use-case.

runtime isn't everything, when it comes to dev time you also want to be faster in coming up with solutions to your problems

I completely agree. As I said, it was acceptable before and it still is. IMHO, the rational is whether or not torchrl is competitive against other RL libraries targeting similar problems. In practice, I was stuck with ray[rllib] because it was at least twice faster than all other RL libraries until recently. torchrl is still not faster on my use-cases, but it has become very competitive. I can accept 10-15% slowdown if it is superior on some other aspects (modularity, maintainability, easy to use...) but hardly more personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants