-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Throughput vs Gym AsyncVectorEnv #1325
Comments
Thanks for reporting this! We appreciate the feedback. The results of the paper were obtained using collectors, not parallel envs. I will be posting the code shortly for reproducibility. Also, I see that you're using the latest stable version of the library (which is good!). You'll be happy to know that we've speeded up a bunch of operations in tensordict and vectorized envs and the nightly releases of tensordict and torchrl should give you better results. There are other optimizations we can do so I'm confident we can accelerate things even more. Executing a slightly modified version of the code above, on my McBook I get to a speed for TorchRL that is 2x slower than the gym one. The overhead is mainly caused by tensordict operations. However, using the config we had in the paper (mainly 32 procs instead of 8 and more cuda devices used for passing data from one proc to another), I get to a speed of 8k fps as reported in the paper. When using 4 parallel envs / collector and 8 collectors, I get to a speed of 16k fps. I will keep on updating this post and related as we optimize things further. |
Hey! I adapted the benchmark for a more straightforward comparison and ran on all the relevant parallel collectors (specs, code, and output below). Summary: # Output is for
# Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).
# Running 8 envs with 80 frames per batch (i.e. 10.0 frames per env).
FPS Gym AsyncVectorEnv mean: 11587.749470459721
FPS Gym AsyncVectorEnv mean: 11317.07864610709
FPS TorchRL with MultiSyncDataCollector on cpu mean: 2866.4653702739765
FPS TorchRL with MultiSyncDataCollector on cpu mean: 979.1281444006343
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 5108.929550860561
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 3225.940279205892
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 4349.725160643495
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1700.2908940649854
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 4745.486839045239
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 5223.255705987036
FPS TorchRL with SyncDataCollector on cpu mean: 587.1912214549519
FPS TorchRL with SyncDataCollector on cpu mean: 547.1234236961403
FPS TorchRL with SyncDataCollector on cuda:0 mean: 3171.8938244285127
FPS TorchRL with SyncDataCollector on cuda:0 mean: 2424.713380369402 ( As @ShaneFlandermeyer got, Gym's And in general, there's a big difference between running large batches and small batches with the
Specs: print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
2023.7.18 1.25.0 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] linux
# torchrl at commit @886635e # Hardware
GPU 0: Tesla V100-SXM2-32GB
2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
❯ mamba list | grep torch
pytorch 2.0.1 py3.10_cuda11.7_cudnn8.5.0_0 pytorch
pytorch-cuda 11.7 h778d358_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchrl-nightly 2023.7.18 pypi_0 pypi
torchtriton 2.0.0 py310 pytorch
torchvision 0.15.2 py310_cu117 pytorch Code import time
import warnings
from argparse import ArgumentParser
from torchrl.collectors.collectors import (
MultiaSyncDataCollector,
MultiSyncDataCollector,
RandomPolicy,
SyncDataCollector,
)
from torchrl.envs import EnvCreator, ParallelEnv
from torchrl.envs.libs.gym import GymEnv
import gymnasium as gym
warnings.filterwarnings("ignore", category=UserWarning)
parser = ArgumentParser()
parser.add_argument("--num_workers", default=8, type=int, help="Number of workers.")
parser.add_argument(
"--frames_per_batch",
default=10_000,
type=int,
help="Number of frames collected in a batch. Must be "
"divisible by the number of workers.",
)
parser.add_argument(
"--total_frames",
default=100_000,
type=int,
help="Total number of frames collected by the collector. Must be "
"divisible by the number of frames per batch.",
)
parser.add_argument(
"--log_every",
default=10_000,
type=int,
help="Number of frames between each log.",
)
parser.add_argument(
"--env",
default="PongNoFrameskip-v4",
help="Gym environment to be run.",
)
if __name__ == "__main__":
args = parser.parse_args()
num_workers = args.num_workers
frames_per_batch = args.frames_per_batch
print(
f"Running {num_workers} envs with {frames_per_batch} frames per batch"
f" (i.e. {frames_per_batch / num_workers} frames per env)."
)
# Test asynchronous gym collector
def test_gym():
env = gym.vector.AsyncVectorEnv(
[lambda: gym.make(args.env) for _ in range(num_workers)]
)
env.reset()
global_step = 0
times = []
start = time.time()
print("Timer started.")
for _ in range(args.total_frames // num_workers):
env.step(env.action_space.sample())
global_step += num_workers
if global_step % int(frames_per_batch) == 0:
times.append(time.time() - start)
fps = frames_per_batch / times[-1]
if global_step % args.log_every == 0:
print(f"FPS Gym AsyncVectorEnv at step {global_step}:", fps)
start = time.time()
env.close()
print("FPS Gym AsyncVectorEnv mean:", args.total_frames / sum(times))
# Test multiprocess TorchRL collector
def test_torch_rl(collector_class, device):
make_env = EnvCreator(lambda: GymEnv(args.env, device=device))
if collector_class in [MultiSyncDataCollector, MultiaSyncDataCollector]:
mock_env = make_env()
collector = collector_class(
[make_env] * num_workers,
policy=RandomPolicy(mock_env.action_spec),
total_frames=args.total_frames,
frames_per_batch=frames_per_batch,
device=device,
storing_device=device,
)
elif collector_class in [SyncDataCollector]:
parallel_env = ParallelEnv(args.num_workers, make_env)
collector = SyncDataCollector(
parallel_env,
policy=RandomPolicy(parallel_env.action_spec),
total_frames=args.total_frames,
frames_per_batch=frames_per_batch,
device=device,
storing_device=device,
)
global_step = 0
times = []
start = time.time()
print("Timer started.")
for i, data in enumerate(collector):
global_step += data.numel()
times.append(time.time() - start)
fps = frames_per_batch / times[-1]
if global_step % args.log_every == 0:
print(
f"FPS TorchRL with {collector_class.__name__} on {device} at step {global_step}:",
fps,
)
start = time.time()
collector.shutdown()
print(
"FPS TorchRL with",
collector_class.__name__,
"on",
device,
"mean:",
args.total_frames / sum(times),
)
test_gym()
for collector_class in [MultiSyncDataCollector, MultiaSyncDataCollector, SyncDataCollector]:
for device in ["cpu", "cuda:0"]:
test_torch_rl(collector_class, device)
exit() Output ❯ python benchmarks/test_torchrl_vs_gym.py --num_workers=8 --frames_per_batch=2_000
Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS Gym AsyncVectorEnv at step 10000: 11880.248747693306
FPS Gym AsyncVectorEnv at step 20000: 14411.756722164288
FPS Gym AsyncVectorEnv at step 30000: 11630.216962415272
FPS Gym AsyncVectorEnv at step 40000: 10516.125311837932
FPS Gym AsyncVectorEnv at step 50000: 9574.003693279737
FPS Gym AsyncVectorEnv at step 60000: 9945.684469115837
FPS Gym AsyncVectorEnv at step 70000: 11437.316106795084
FPS Gym AsyncVectorEnv at step 80000: 8848.28526795556
FPS Gym AsyncVectorEnv at step 90000: 11634.265615941727
FPS Gym AsyncVectorEnv at step 100000: 10506.484039768444
FPS Gym AsyncVectorEnv mean: 11587.749470459721
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 2834.0370614368485
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 3115.0810538222336
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 2827.5250466248276
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 3015.099926353219
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 2788.602794517342
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 3205.3090866295415
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 2482.831288036687
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 2844.3913975942423
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 3150.4935152535104
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 3045.399594411799
FPS TorchRL with MultiSyncDataCollector on cpu mean: 2866.4653702739765
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 5257.937418281604
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 5208.714091009286
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 5117.738499344772
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 5137.022029172594
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 5189.274791404789
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 5110.6078855326305
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 3927.3263204022173
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 5181.009037697941
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 5198.836110327817
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 5182.798409438583
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 5108.929550860561
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 5337.133782600729
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 1227.4027505992447
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 4303.450257146888
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 14703.854837126471
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 1778.4022887743843
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 6623.503838558663
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 4115.28234834704
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 4658.867646209418
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 3634.1222290093533
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 6693.456638912273
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 4349.725160643495
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 689399.0795529257
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 18005.632254962566
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 264391.32627332327
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 130747.17498714132
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 776.6116486266598
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 41273.37941892789
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 49679.94646230745
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 495107.59605736885
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 113989.59111847916
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 5840.644401694138
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 4745.486839045239
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cpu at step 10000: 689.0744194898969
FPS TorchRL with SyncDataCollector on cpu at step 20000: 548.245757072972
FPS TorchRL with SyncDataCollector on cpu at step 30000: 622.7927557851568
FPS TorchRL with SyncDataCollector on cpu at step 40000: 565.5496544653071
FPS TorchRL with SyncDataCollector on cpu at step 50000: 549.4385496083208
FPS TorchRL with SyncDataCollector on cpu at step 60000: 589.2434754865648
FPS TorchRL with SyncDataCollector on cpu at step 70000: 630.9949199112759
FPS TorchRL with SyncDataCollector on cpu at step 80000: 624.1467153348757
FPS TorchRL with SyncDataCollector on cpu at step 90000: 497.86883899324306
FPS TorchRL with SyncDataCollector on cpu at step 100000: 504.7398666134446
FPS TorchRL with SyncDataCollector on cpu mean: 587.1912214549519
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 3904.9437622718906
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 3288.8724744697624
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 3437.67936578839
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 3400.23242154962
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 3167.2741743898305
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 3348.8633532834688
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 3145.3018116045278
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 3288.962738376645
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 3215.286298039423
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 1305.8973618519733
FPS TorchRL with SyncDataCollector on cuda:0 mean: 3171.8938244285127
------------------------------------------------------------------------------------------
❯ python benchmarks/test_torchrl_vs_gym.py --num_workers=8 --frames_per_batch=80
Running 8 envs with 80 frames per batch (i.e. 10.0 frames per env).
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS Gym AsyncVectorEnv at step 10000: 14207.745268238981
FPS Gym AsyncVectorEnv at step 20000: 13789.681502486335
FPS Gym AsyncVectorEnv at step 30000: 9946.76943143416
FPS Gym AsyncVectorEnv at step 40000: 8407.314274260229
FPS Gym AsyncVectorEnv at step 50000: 12323.95489771183
FPS Gym AsyncVectorEnv at step 60000: 8352.275601135063
FPS Gym AsyncVectorEnv at step 70000: 12466.351612423838
FPS Gym AsyncVectorEnv at step 80000: 10677.283777763636
FPS Gym AsyncVectorEnv at step 90000: 13954.267653663812
FPS Gym AsyncVectorEnv at step 100000: 7710.826362717162
FPS Gym AsyncVectorEnv mean: 11317.07864610709
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 1075.6177012710166
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 1057.8420350759939
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 943.7968531134151
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 1059.3114595731113
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 880.3610184077409
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 1175.6983882270497
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 1007.8040282808623
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 951.8447747645523
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 1462.7104738904702
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 1253.7666695313287
FPS TorchRL with MultiSyncDataCollector on cpu mean: 979.1281444006343
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 3048.4352826811787
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 3262.14582928252
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 3217.5086060582817
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 3344.70669152022
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 3387.694047330587
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 3303.544515658997
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 3303.2518212246505
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 3550.6959714712016
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 3221.864689954487
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 2741.64395202144
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 3225.940279205892
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 5388.365878725591
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 27384.66661225822
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 2353.42530702708
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 657.7791369430935
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 43965.45073375262
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 2589.4961374914146
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 1340.4240054009356
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 47153.50196739742
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 40790.702650133724
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 9932.636315197442
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1700.2908940649854
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 27906.214238190285
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 5383.3518369966305
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 19511.793917543757
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 75863.51345240787
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 3048.1029768447443
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 73875.89608102158
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 13952.526924196432
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 3091.463160707211
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 124367.79836916234
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 2116.4113432233326
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 5223.255705987036
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cpu at step 10000: 533.9423450260014
FPS TorchRL with SyncDataCollector on cpu at step 20000: 618.3473233980839
FPS TorchRL with SyncDataCollector on cpu at step 30000: 633.7098955225008
FPS TorchRL with SyncDataCollector on cpu at step 40000: 629.6678307571342
FPS TorchRL with SyncDataCollector on cpu at step 50000: 591.178977598069
FPS TorchRL with SyncDataCollector on cpu at step 60000: 410.9202417924778
FPS TorchRL with SyncDataCollector on cpu at step 70000: 488.02613322008165
FPS TorchRL with SyncDataCollector on cpu at step 80000: 605.4527213805746
FPS TorchRL with SyncDataCollector on cpu at step 90000: 695.1189003863564
FPS TorchRL with SyncDataCollector on cpu at step 100000: 81.004056910206
FPS TorchRL with SyncDataCollector on cpu mean: 547.1234236961403
...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 3152.338058867187
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 2331.9178272593335
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 3397.9515741931564
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 1760.971529035136
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 2330.816337871631
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 2617.146244442711
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 2050.628368881012
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 2597.574781693194
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 2078.3301228251644
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 90.8393065937377
FPS TorchRL with SyncDataCollector on cuda:0 mean: 2424.713380369402 |
Thanks guys for looking into this.
Work planHere is what I'm envisioning for this:
def rollout_autoreset(self):
result = []
next_result = []
cur_data = env.reset()
for i in range(T):
_cur_data, next_data = env.step(cur_data)
# cur_data and next_data are well synced
result.append((cur_data, next_data))
# now step_mdp chooses between _cur_data and next_data based on the done state.
# with envs that have a non-empty batch size, it can mix them together
cur_data = step_mdp(_cur_data, next_data)
result, next_result = [torch.stack(r) for r in zip(*result)]
result.set("next", next_result)
return result Why is this better? How do we get there?This is going to be hugely bc-breaking so it will have to go through prototyping + deprecation message (0.2.0, a bit later this year) -> deprecation + possibility of using the old feature (0.3.0, early 2024) -> total deprecation (0.4.0, somewhere in 2024). I expect the speedup to bring the envs closer to par with gym in terms of rollout and bring the data collection using collector at a superior speed than all other regular loops when executed on device (across sync and async). I will open a PoC soon, hoping to get some feedback! |
This sounds like a lot of work. Would it be easier to just wrap the gym env in the Autoreset wrapper? |
It seems quite a complex solution.
def step_and_maybe_reset(td):
tensordict = self._step(tensordict) # this is the current step in main. aka tensordict will have both root and next
_reset = tensordict.get(("next",self.done_key)
next_td = step_mdp(tensordict)
reset_td = next_td.clone()
reset_td.set("_reset", _reset)
self.reset(reset_td)
return reset_td, next_td # these will both contain data only from the next step, but one has been reset and the other not alternatively you could also return the data passed as input as a third output. |
Something really strange is happening with the current @vmoens any quick intuition? I tried running as last time with a docker setup on the same cluster, but also natively on a GCP machine and the max-out CPU problem is still present. Can you reproduce this? Some logs of the new performance. # python benchmark.py --num_workers=8 --frames_per_batch=2_000
FPS Gym AsyncVectorEnv mean: 5299.461355240903
FPS TorchRL with MultiSyncDataCollector on cpu mean: 1409.2643700762942
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 1434.0322984191869
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1539.6185885727893
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 1473.2058280519127
FPS TorchRL with SyncDataCollector on cpu mean: 369.65728004849865
FPS TorchRL with SyncDataCollector on cuda:0 mean: 463.8032136205446
/mloraw1/moalla/open-source/torchrl/rl/benchmarks main*
implicit-pg ❯ python skander.py --num_workers=8 --frames_per_batch=2_000
Running 8 envs with 2000 frames per batch (i.e. 250.0 frames per env).
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS Gym AsyncVectorEnv at step 10000: 4331.723599469573
FPS Gym AsyncVectorEnv at step 20000: 5204.412883276483
FPS Gym AsyncVectorEnv at step 30000: 4185.867384286969
FPS Gym AsyncVectorEnv at step 40000: 5084.152751110335
FPS Gym AsyncVectorEnv at step 50000: 4690.31147382411
FPS Gym AsyncVectorEnv at step 60000: 6040.902036892344
FPS Gym AsyncVectorEnv at step 70000: 6695.075310147556
FPS Gym AsyncVectorEnv at step 80000: 6469.561408176754
FPS Gym AsyncVectorEnv at step 90000: 7322.151306509504
FPS Gym AsyncVectorEnv at step 100000: 5166.639771866397
FPS Gym AsyncVectorEnv mean: 5299.461355240903
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cpu at step 10000: 1320.6028552739604
FPS TorchRL with MultiSyncDataCollector on cpu at step 20000: 1446.4029971546481
FPS TorchRL with MultiSyncDataCollector on cpu at step 30000: 1430.5873492922874
FPS TorchRL with MultiSyncDataCollector on cpu at step 40000: 1550.6232173654255
FPS TorchRL with MultiSyncDataCollector on cpu at step 50000: 1319.6976557084743
FPS TorchRL with MultiSyncDataCollector on cpu at step 60000: 1365.2251196885925
FPS TorchRL with MultiSyncDataCollector on cpu at step 70000: 1430.506355384124
FPS TorchRL with MultiSyncDataCollector on cpu at step 80000: 1391.8008827041454
FPS TorchRL with MultiSyncDataCollector on cpu at step 90000: 1450.1957484061622
FPS TorchRL with MultiSyncDataCollector on cpu at step 100000: 1506.2325100349867
FPS TorchRL with MultiSyncDataCollector on cpu mean: 1409.2643700762942
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 10000: 1511.1733695354542
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 20000: 1394.4858962330331
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 30000: 1495.3765765560202
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 40000: 1542.618458368687
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 50000: 1522.7536971932934
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 60000: 1519.255640425426
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 70000: 1313.5330130432453
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 80000: 1442.4469610732624
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 90000: 1369.7093138217786
FPS TorchRL with MultiSyncDataCollector on cuda:0 at step 100000: 1410.1916111111952
FPS TorchRL with MultiSyncDataCollector on cuda:0 mean: 1434.0322984191869
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cpu at step 10000: 8251.768916064895
FPS TorchRL with MultiaSyncDataCollector on cpu at step 20000: 3297.1962499272845
FPS TorchRL with MultiaSyncDataCollector on cpu at step 30000: 2217.987399000709
FPS TorchRL with MultiaSyncDataCollector on cpu at step 40000: 8748.476584294975
FPS TorchRL with MultiaSyncDataCollector on cpu at step 50000: 586.1939663853168
FPS TorchRL with MultiaSyncDataCollector on cpu at step 60000: 2189.470475220034
FPS TorchRL with MultiaSyncDataCollector on cpu at step 70000: 953.2261035479872
FPS TorchRL with MultiaSyncDataCollector on cpu at step 80000: 3426.168919777176
FPS TorchRL with MultiaSyncDataCollector on cpu at step 90000: 4175.730351396986
FPS TorchRL with MultiaSyncDataCollector on cpu at step 100000: 2373.595130464702
FPS TorchRL with MultiaSyncDataCollector on cpu mean: 1539.6185885727893
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 10000: 16983.39440811451
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 20000: 8616.284707100422
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 30000: 10179.842993249122
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 40000: 30913.891079549223
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 50000: 295.2667884207772
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 60000: 3726.762953649857
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 70000: 3346.7536887226006
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 80000: 3940.057969326326
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 90000: 4978.213290415082
FPS TorchRL with MultiaSyncDataCollector on cuda:0 at step 100000: 14703.31361465317
FPS TorchRL with MultiaSyncDataCollector on cuda:0 mean: 1473.2058280519127
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cpu at step 10000: 280.5230834569255
FPS TorchRL with SyncDataCollector on cpu at step 20000: 291.6941486528101
FPS TorchRL with SyncDataCollector on cpu at step 30000: 487.1544962849032
FPS TorchRL with SyncDataCollector on cpu at step 40000: 356.98548206803537
FPS TorchRL with SyncDataCollector on cpu at step 50000: 494.37040263102915
FPS TorchRL with SyncDataCollector on cpu at step 60000: 483.80378322362986
FPS TorchRL with SyncDataCollector on cpu at step 70000: 555.6786701032164
FPS TorchRL with SyncDataCollector on cpu at step 80000: 475.9004904517329
FPS TorchRL with SyncDataCollector on cpu at step 90000: 352.64654916557294
FPS TorchRL with SyncDataCollector on cpu at step 100000: 289.18409560006296
FPS TorchRL with SyncDataCollector on cpu mean: 369.65728004849865
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Timer started.
FPS TorchRL with SyncDataCollector on cuda:0 at step 10000: 356.0708200186096
FPS TorchRL with SyncDataCollector on cuda:0 at step 20000: 416.6240826727289
FPS TorchRL with SyncDataCollector on cuda:0 at step 30000: 411.62237630608746
FPS TorchRL with SyncDataCollector on cuda:0 at step 40000: 448.2072555049812
FPS TorchRL with SyncDataCollector on cuda:0 at step 50000: 451.1781603430556
FPS TorchRL with SyncDataCollector on cuda:0 at step 60000: 505.79521035558116
FPS TorchRL with SyncDataCollector on cuda:0 at step 70000: 529.77273714769
FPS TorchRL with SyncDataCollector on cuda:0 at step 80000: 569.5611329452704
FPS TorchRL with SyncDataCollector on cuda:0 at step 90000: 432.81666192415577
FPS TorchRL with SyncDataCollector on cuda:0 at step 100000: 373.8837006455133
FPS TorchRL with SyncDataCollector on cuda:0 mean: 463.8032136205446 |
Weird, let me look into it. |
Yes, the SyncDataCollector is using ParallelEnv inside. |
I'm trying to revert to the previous version to see if it's my environment that changed or something in TorchRL. |
Could be due to #1532 or something like that... |
Oh have you tried If that solves your problem, I can write a set_num_threads decorator to be used with torchrl (bc I don't think that changing that in the main process every time you load torchrl is very wise :) ) |
I think you saved me hours/days of debugging and research time 🥹🥹 |
|
Cool! So that corroborates what I thought:
|
Btw, I reran the benchmark with a 3-month old TorchRL/Tensordict versions and got the same issue when not specifying the number of threads. I think it's probably a change in the cluster I'm using. |
Hi guys, #1602 contains a simple script we used for the paper to benchmark against gym async collector. Feel free to run it and give us feedback. |
It's a rather long and thorough thread (thanks for it!) so I'll answer piece by piece.
That doesn't surprise me (unfortunately).
In this case I think the best could be to simply code your environment from EnvBase. It could be simpler.
There's a lot more there we need to account for and every time someone comes with an env-specific bug we need a patch that eats up a bit of compute time... Something odd (IMO) is that many users, even close collaborators, only see our envs as "wrappers" when "wrappers" are only a fraction of what envs are. I'm super biased to take this with a massive pinch of salt but in some way I think EnvBase could serve as a base for pytorch-based env like Regarding
That's communication overhead I think. Simply writing and reading tensors... We made good progress speeding this up (eg. using mp.Event in parallel processes) but there's always more to do! If you don't need sync data (eg, off-policy) MultiaSync is usually faster than MultiSync. Back in the days sharing tensors with CUDA devices was way faster than CPU (shared mem) but for some reason the balance has now shifted to cpu. I have no idea why! Is there any way you could share the script and the method you're using for benchmarking, for reproducibility? Thanks! |
Here is a complete reproduction script. It requires installing import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['VECLIB_MAXIMUM_THREADS'] = '1'
os.environ['NUMEXPR_NUM_THREADS'] = '1'
from functools import reduce, partial
import numpy as np
from tqdm import tqdm
from gym_jiminy.envs import AtlasPDControlJiminyEnv
from gym_jiminy.common.wrappers import (FilterObservation,
NormalizeAction,
NormalizeObservation,
FlattenAction,
FlattenObservation)
from torchrl.collectors import MultiSyncDataCollector
from torchrl.envs.libs.gym import GymWrapper
from torchrl.envs import EnvCreator
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
N_ENVS = 16
N_WORKERS = 16
class ZeroPolicy:
def __init__(self, action_spec, action_key = "action"):
self.action_spec = action_spec
self.action_key = action_key
def __call__(self, td):
return td.set(self.action_key, self.action_spec.zero())
if __name__ == '__main__':
# Fix weird issue with multiprocessing
__spec__ = None
# Define the learning environment
gym_make = lambda: reduce(
lambda env, wrapper: wrapper(env),
(
partial(FlattenObservation, dtype=np.float32),
NormalizeObservation,
partial(FlattenAction, dtype=np.float32),
NormalizeAction,
),
FilterObservation(
AtlasPDControlJiminyEnv(),
nested_filter_keys=(
('states', 'pd_controller'),
('measurements', 'EncoderSensor'),
('features', 'mahony_filter'),
)
)
)
env_creator = EnvCreator(
lambda: GymWrapper(gym_make(), device="cpu"))
# Instantiate a dummy environment
dummy_env = env_creator()
# Instantiate and configure the data collector
collector = MultiSyncDataCollector(
N_WORKERS * (env_creator,),
ZeroPolicy(dummy_env.action_spec),
frames_per_batch=80000,
total_frames=4000000,
# preemptive_threshold=0.8,
device="cpu",
storing_device="cpu",
# num_threads=1,
num_sub_threads=1
)
frames_per_batch = collector.frames_per_batch_worker * collector.num_workers
# Collect data
pbar = tqdm(total=collector.total_frames, unit=" frames")
for data_splitted in collector:
pbar.update(frames_per_batch)
# Stop the data collector
collector.shutdown() I run it on a new machine (Apple M3 Max) with the latest |
After looking more closely, I realised that most of the slowdown was coming from the generic wrapper around gymnasium environments instead of interprocess communication: If I run the same benchmark with a single
It shows that at least 43.26s is spent in torchrl for a total running time of 290.6s, namely 15%. I guess it is not summing up to 30% as I mentioned earlier because of profiling that is distorting the statistics. From this standpoint, it is not clear to me if anything can be done to speed things up. |
Good to know thanks for investigating this! |
In my way, there is no such thing as "next gen of simulators" in the real world. Both single cpu-based and vectorized gpu simulators are still and will remain relevant for their own respective applications. First, only classic cpu mode is relevant for all but RL applications, since in the vast majority of use cases you are only willing to run a single simulation at a time, not to mention critical embedded software. Yet, it is critical to use the same simulator over the whole pipeline, from RL training to classical offline planning algorithm and online model-based predictive control. Not only because it is the only way to make fair comparison between methods without doing real experiments, but also because fine tuning several simulators to make them as realistic as possible for a given use case is too much effort. Since I don't think it is realistic to expect from a simulator to support both vectorized gpu mode and classic cpu mode, classic cpu-based simulation may be the only viable option in practice. Apart from that, in various real-world training scenarios, the actual simulated system may change internally between episodes, for instance to challenge the same policy on different models for the same physical platform (eg broken parts) as an advanced form of domain randomization. In such a case, batched gpu simulation in not applicable. Next, running cpu-based simulations in parallel is already fast enough for real-world R&D on complex systems such as humanoid robotics. For instance, it takes only 1h to collect 100M timesteps on a Macbook Pro M3 when training locomotion on Boston Dynamics' Atlas robot. No need to go faster if you cannot iterate faster because analyzing the results is time consuming anway. Finally, many complex algorithms are not yet ready to be integrated in batched gpu simulators, ie complex mesh-mesh collision detection algorithms, so if you want to perform a very realistic simulation you need to fallback to cpu-based simulator. Mujoco and Isaac certainly didn't rise to fame on the basis of how realistic they are. To wrap up, cpu-based simulation is definitely nowhere near dead to me. |
I second @duburcqa's comment. Simulators like Issac and the new mujoco are awesome if you're doing pure RL algorithm development, but seem less useful for applied RL research where you spend most of your time making custom environments. In my use case, CPU environments provide a reasonable trade-off between simulation speed and development time early on in the design process, which is an important first step that should not be overlooked IMO. Just my perspective from the applications side of things. |
I also agree. Especially in the field of robotics it is always important to remember that the real world is not vectorized and online learning in the real world is something that will gain increasing attention. |
Thanks all for the valuable feedback! Those a really valid points. So what's your take on this topic for torchrl then? What's the best way forward? The overhead observed by @duburcqa is hard to solve because gym is not very explicit about what it returns: unlike torchrl I can't tell in advance if I will have a info dict or not, if my obs dictionary is complete or not, I can't even tell if my reward is a float or a numpy.ndarray... For these reasons we have to do multiple checks. It was once suggested to me that we could do these checks for the first iteration and then stick by it with some sort of compiled code but I don't really see how to make that happen in a simple way. One option is to document how to write a custom gym wrapper with no checks to improve the runtime.
For sure but I don't think that applies to this case, where you'd wrap a gym env in torchrl and do checks over the types and devices etc. If you're working with a robot you will most likely have your own environment tailored for that use case. In other words, I don't think that this impacts whether or not we should dedicate a lot of effort to bridge a potential 20% runtime gap compared to gym async envs. |
@duburcqa I forgot to ask: are you using tensordict nightlies or the latest stable version? |
All of this could be done only once, at init, since it is reasonable to expect that types do not change over steps. This way, it would not add any runtime cost. Ideally, the whole computation path should be defined statically once and for all, then called whenever it is necessary. here is an example where I do this. I agree it is quite tricky to implement but the performance benefit can be very significant for the hot path. Still, maybe it is not necessary to go this far and there is a trade-off between full static computation path and full runtime path.
I'm clearly fine with it !
I agree this is not the most convincing argument.
The latest stable, but I could use something else if you want. |
For the record, here are the result of my benchmark for latest release (
It shows that 38.66s is spent in torchrl/tensordict methods for a total running time of 291.0s, namely 13.3%. It was 43.26s (14.8%) on |
What do you include in those 38s? |
The |
Got it Besides, correct me if I'm wrong but I think that being around 10% of runtime for torchrl can be a hit that most people are ready to take, and this for two reasons:
|
Yes, indeed. I don't think it is worth the effort at this point. Yet, the issue with profiling is that it alters the original timing. I'm observing a slowdown due to torchrl data collection against running episodes without torch and throwing away all the samples that is twice larger if with profiling disabled than enabled. I need to check again but I'm expecting an actual slowdown close to 25% rather than 13% on a real use-case.
I completely agree. As I said, it was acceptable before and it still is. IMHO, the rational is whether or not torchrl is competitive against other RL libraries targeting similar problems. In practice, I was stuck with |
Describe the bug
Hello, I'm performing experiments that use a relatively small number of parallel environments (8-16). Using the PongNoFrameskip-v4 environment with no wrappers, it seems that TorchRL is 4-5x slower than Gym's AsyncVectorEnv (2600 vs 11000 FPS) with a random policy. Given the throughput results in Table 2 of the paper, I would expect comparable performance. Am I setting up the environments incorrectly?
To Reproduce
This is a very simple adaptation of the script in
examples/distributed/single_machine/generic.py
. Although it's not shown here, I observe similar performance withParallelEnv
and a synchronous collector.System info
TorchRL installed via pip (v0.1.1)
Checklist
The text was updated successfully, but these errors were encountered: