Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Gym 'vectorized' envs compatibility #1519

Merged
merged 20 commits into from
Sep 17, 2023
Merged

[Feature] Gym 'vectorized' envs compatibility #1519

merged 20 commits into from
Sep 17, 2023

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Sep 12, 2023

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2023
@vmoens vmoens added the enhancement New feature or request label Sep 12, 2023
Copy link
Contributor Author

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertbou92 @matteobettini should we add a "_reset" key too?
IMO the transforms can take care of it but we should test it

Comment on lines 246 to 250
# return (
# MultiDiscreteTensorSpec(spec.nvec, device=device, dtype=dtype)
# if categorical_action_encoding
# else MultiOneHotDiscreteTensorSpec(spec.nvec, device=device, dtype=dtype)
# )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed

Suggested change
# return (
# MultiDiscreteTensorSpec(spec.nvec, device=device, dtype=dtype)
# if categorical_action_encoding
# else MultiOneHotDiscreteTensorSpec(spec.nvec, device=device, dtype=dtype)
# )

Copy link
Contributor

@matteobettini matteobettini Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait why would it be removed? this is actually very needed and important
(it is used in mpe and vmas) (althoug not supported completely in our losses :( )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between that and a stack of discrete specs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is that a MultiDiscrete(2, 2) has a batch-size of 2, but that is being translated to a MultiOneHot of size (4,), which is batch-size incompatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are envs where you need to select two actions at the same time (let's say one for movement and one for comms)

those 2 actions might have different spaces. for example in vmas movement is of space 4 and comms can be anything (for example 10). why would we stack this?

if the problem here is that they are using multidiscrete to indicate batched envs i think they overloaded multidiscrete then, because that should just be a batched discrete

Copy link
Contributor Author

@vmoens vmoens Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works perfectly

disc = MultiDiscrete([[2, 3], [4, 5]])
spec = _gym_to_torchrl_spec_transform(disc, categorical_action_encoding=True)
spec.rand()

With categorical_action_encoding=False it breaks.

We can fall back on MultiOneHot or similar, but the main goal should be that the shape of MultiDiscrete is unchanged after conversion

why would we stack this?

I'm not saying you need this, it's a proposed change. We need this for async envs in gym which is thing that is important to support. I think that using MultiOneHot when we can just have a batched discrete makes little sense, so let's try to see how we can make that happen.

Quick question: what would you expect from MultiDiscrete([[2, 3], [4, 5]]) to torchrl spec?
A MultiOneHot will be hard to work with bc you don't have a consistent last dim (one is 2 + 4 and the other 3 + 5 if i'm not wrong, so you can't stack these together -> you will have a [2, *undifined] shape).
Would that make sense to restrict this to be non-one hot (ie discrete) in these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a comprehensive example after my latest push

from gymnasium.spaces import MultiDiscrete

from torchrl.envs.libs.gym import _gym_to_torchrl_spec_transform

disc = MultiDiscrete([[2, 3], [2, 3]])

print("multi-discrete, gym", disc)
print("nvec shape", disc.nvec.shape)
spec = _gym_to_torchrl_spec_transform(disc, categorical_action_encoding=True)
print("spec, discrete", spec)
print("sample", spec.rand())
print("spec shape", spec.shape)
spec = _gym_to_torchrl_spec_transform(disc, categorical_action_encoding=False)
print("spec, one-hot", spec)
print("sample", spec.rand())
print("spec shape", spec.shape)

disc = MultiDiscrete([[2, 3], [4, 5]])

print("multi-discrete, gym", disc)
print("nvec shape", disc.nvec.shape)
spec = _gym_to_torchrl_spec_transform(disc, categorical_action_encoding=True)
print("spec, discrete", spec)
print("sample", spec.rand())
print("spec shape", spec.shape)
spec = _gym_to_torchrl_spec_transform(disc, categorical_action_encoding=False)  # breaks
print("spec, one-hot", spec)
print("sample", spec.rand())
print("spec shape", spec.shape)

the print looks like this

multi-discrete, gym MultiDiscrete([[2 3]
 [2 3]])
nvec shape (2, 2)
spec, discrete MultiDiscreteTensorSpec(
    shape=torch.Size([2, 2]),
    space=BoxList(boxes=[BoxList(boxes=[DiscreteBox(n=2), DiscreteBox(n=2)]), BoxList(boxes=[DiscreteBox(n=3), DiscreteBox(n=3)])]),
    device=cpu,
    dtype=torch.int64,
    domain=discrete)
sample tensor([[0, 0],
        [1, 2]])
spec shape torch.Size([2, 2])
spec, one-hot MultiOneHotDiscreteTensorSpec(
    shape=torch.Size([2, 5]),
    space=BoxList(boxes=[DiscreteBox(n=2), DiscreteBox(n=3)]),
    device=cpu,
    dtype=torch.int64,
    domain=discrete)
sample tensor([[0, 1, 1, 0, 0],
        [0, 1, 0, 1, 0]])
spec shape torch.Size([2, 5])
multi-discrete, gym MultiDiscrete([[2 3]
 [4 5]])
nvec shape (2, 2)
spec, discrete LazyStackedMultiDiscreteTensorSpec(
    shape=torch.Size([2, 2]), device=cpu, dtype=torch.float32, domain=discrete)
sample tensor([[1, 0],
        [0, 0]])
spec shape torch.Size([2, 2])
spec, one-hot LazyStackedMultiOneHotDiscreteTensorSpec(
    shape=torch.Size([2, -1]), device=cpu, dtype=torch.float32, domain=discrete)

@matteobettini what should we do with the example that breaks?

Copy link
Contributor Author

@vmoens vmoens Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve this issue:

  • Can you provide an example of where you expect that things will break and/or the user experience will be negatively impacted. Here is my 2 cents:
    • MultiOneHot is a giant hack. Per se, you work with a tensor that is strongly tied to its spec. Without the spec, you cannot know what this tensor represents: the only instance I see where you know how to split it is all cases like [False, False, True, True, False]. So because if you're using MultiOneHot you must use the spec wherever you use the tensor, then when you use the simplified version (with OneHot + batch-size) you will also have the spec at hand (because I presume that your code base handles the most complex use case so the simpler one falls into the same requisites). Therefore, anything you do with your tensor can probably be represented with methods from the spec (e.g. spec.encode, spec.to_numpy etc) and you very rarely interact directly with the tensor. I do not see anywhere where a control flow would be added like if isinstance(spec, MultiOneHot): ... ; else ....
    • Along the same line, it isn't easy to work with MultiOneHot and I'm not sure I've ever seen such a data structure anywhere else than torchrl. So from a UX perspective restricting its usage to edge cases isn't a great price to pay.
    • It is important also to acknowledge what percentage of users (roughly) will be impacted by which decision. MultiOneHot is to be used by the following subset of users: those who work with more than one agent/env + have action space with heterogeneous action + want to use one-hot encoding (and not discrete). Counting that each of these is roughly 50% of users, deciding that the behaviour changes when agents are homogeneous or heterogeneous impacts at the most 0.5 ** 3 = 0.125 of our user base (and this is assuming that there actually is a negative impact for these people, but I haven't really understood what the negative impact is yet).
    • I see the point that a different config will create different classes depending on homogeneity. On the other hand, this is the case in many paradigm (e.g. pytorch dataloaders are from different classes if num_workers=0 or num_workers=2 and people are usually not complaining about it much because the way they work with the dataloader isn't impacted).

The most important thing is what solution we can provide to this problem:
What is the way forward to you? Here are the options I see

  • Leave as it. This means that with Gymnasium, vectorized envs will have a MultiOneHot(num_envs * num_action) spec which we will have a hard time fitting in the env.
  • Have a distinct behaviour for gymnasium, gym and other libs like VMAS. This will be a high price to pay in term of UX: we'll need to document that with gym one should expect the conversion to be like X->Y, with VMAS X->Z and with gymnasium X->W. Therefore, we gain the consistency within a domain (eg, MARL) but loose it across libs, which is one feature we want to promote about torchrl I think.
  • Use the proposed with the following caveats:
    • If a MultiOneHot is created, the user is informed once in the execution that this is the spec they will get. We explicitly say why they should not use one-hot (performance is poorer, performance degrates as the action space increases and stacking heterogeneous creates a structure that is hard to work with)
    • In a not so distant future we switch the default to discrete and make batched discrete fall back onto multi-discrete (I think it's already the case).

Copy link
Contributor

@matteobettini matteobettini Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current proposed solution is the best in order to be able to handle gymnasium vectorised environments that are using MultiDiscrete spaces to represent the vectorisation dimension.
We can take this route and take some precautions as discussed, for example:

  • Document the spec translation mechanism
  • Default to categorical action encoding

Example

My example about a potential surprising behaviour is the following (it is in the single-agent domain, so it is not related to MARL):

spec = gym.spec.MultiDiscrete(n1, n2) # Where n1 is for example the number of movement actions and n2 is the number of communication actions

Now, when this is translated in torchrl, with action encoding set to one-hot, it will give the following specs:

if n1 == n2:
    OneHotTensorSpec(shape=(2, n1)) 
else:
    MultiOneHotTensorSpec(shape=(n1+n2))

So simplifying the spec to a OneHot in case the action sizes are the same.

That said, I am on board with the current solution for gymnasium spec translation as in the above example we do not have the tools to understand if we are seeing a single environment with multiple discrete actions or a vectorized environment with a single discrete action.

So we can go ahead with the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I'll check on some edge cases and ping you with these!

Comment on lines 229 to 233
# dtype = (
# numpy_to_torch_dtype_dict[spec.dtype]
# if categorical_action_encoding
# else torch.long
# )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# dtype = (
# numpy_to_torch_dtype_dict[spec.dtype]
# if categorical_action_encoding
# else torch.long
# )
# dtype = (
# numpy_to_torch_dtype_dict[spec.dtype]
# if categorical_action_encoding
# else torch.long
# )

Comment on lines +234 to +243
return torch.stack(
[
_gym_to_torchrl_spec_transform(
spec[i],
device=device,
categorical_action_encoding=categorical_action_encoding,
remap_state_to_observation=remap_state_to_observation,
)
for i in range(len(spec.nvec))
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, is it what we decided in our very long thread.

I think this makes sense to me

return super()._build_env(env, pixels_only=pixels_only, from_pixels=from_pixels)
env = super()._build_env(env, pixels_only=pixels_only, from_pixels=from_pixels)
if num_envs > 0:
return self._async_env([CloudpickleWrapper(lambda: env)] * num_envs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit hacky but we need that to make it work with wrappers like Pixels stuff

@@ -144,7 +144,7 @@ def read_done(self, done):
done (np.ndarray, boolean or other format): done state obtained from the environment

"""
return done, done
return done, done.any() if not isinstance(done, bool) else done
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularily like it
One aspect to consider is when we're doing our skip_frame but one of the envs is done and not the others. Then we stop the loop. This means that a bunch of envs silently don't complete the loop. Not sure what to do bc we can't really keep on applying the same action to the other env if it's in auto-reset.

One solution is that if a non-complete done is encountered we raise a warning telling users to use the skip-frame transform instead.

Copy link
Contributor

@albertbou92 albertbou92 Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the situation you described is too common and silently not completing the loop is dangerous.

A bit hacky but cant we create some kind on NO-OP step wrapper (e.g. if action is -1, just returns the last obs, reward etc, otherwise normal step)? and overwrite the actions of the envs that have reached the end of the episode before skip_frame actions are taken.

You do the loop skip_frame times for all envs but for early finishing envs a useless step will be called at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about disabling frame_skip for num_envs > 0, and just say that one needs to use the transform instead?

Copy link
Contributor

@albertbou92 albertbou92 Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I always thought frame_skip as a transform was more consistent with the rest of the library. I thought there was some problem that made it less efficient or sth and that was why the option was also given in the env. But yes, if that is not the case, I think using the transform is the best way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmoens quick heads up here. I think the frame_skip transform did not handle truncation/termination correctly the last time I checked. Also, some envs natively handle frame_skip and this runs much faster as they don't have to render the frames, e.g. ALE-v5 which skips the frames internally with the frameskip keyword in the constructor.

@github-actions
Copy link

github-actions bot commented Sep 12, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 89. Improved: $\large\color{#35bf28}14$. Worsened: $\large\color{#d91a1a}3$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 79.2122ms 78.7944ms 12.6913 Ops/s 12.8995 Ops/s $\color{#d91a1a}-1.61\%$
test_sync 0.1224s 45.3155ms 22.0675 Ops/s 22.2528 Ops/s $\color{#d91a1a}-0.83\%$
test_async 80.4515ms 40.5331ms 24.6712 Ops/s 23.8684 Ops/s $\color{#35bf28}+3.36\%$
test_simple 0.7177s 0.6508s 1.5365 Ops/s 1.5716 Ops/s $\color{#d91a1a}-2.24\%$
test_transformed 0.9062s 0.8453s 1.1830 Ops/s 1.1968 Ops/s $\color{#d91a1a}-1.15\%$
test_serial 1.8966s 1.8407s 0.5433 Ops/s 0.5507 Ops/s $\color{#d91a1a}-1.36\%$
test_parallel 1.7739s 1.5540s 0.6435 Ops/s 0.6681 Ops/s $\color{#d91a1a}-3.68\%$
test_step_mdp_speed[True-True-True-True-True] 0.2021ms 44.8609μs 22.2911 KOps/s 22.5906 KOps/s $\color{#d91a1a}-1.33\%$
test_step_mdp_speed[True-True-True-True-False] 47.6010μs 25.2501μs 39.6038 KOps/s 39.8588 KOps/s $\color{#d91a1a}-0.64\%$
test_step_mdp_speed[True-True-True-False-True] 0.1171ms 31.6690μs 31.5766 KOps/s 32.0356 KOps/s $\color{#d91a1a}-1.43\%$
test_step_mdp_speed[True-True-True-False-False] 45.5010μs 17.5205μs 57.0760 KOps/s 56.8480 KOps/s $\color{#35bf28}+0.40\%$
test_step_mdp_speed[True-True-False-True-True] 0.1290ms 46.1845μs 21.6523 KOps/s 21.7101 KOps/s $\color{#d91a1a}-0.27\%$
test_step_mdp_speed[True-True-False-True-False] 0.1217ms 27.0362μs 36.9875 KOps/s 37.2698 KOps/s $\color{#d91a1a}-0.76\%$
test_step_mdp_speed[True-True-False-False-True] 0.1125ms 33.2361μs 30.0878 KOps/s 30.1675 KOps/s $\color{#d91a1a}-0.26\%$
test_step_mdp_speed[True-True-False-False-False] 53.4010μs 19.7912μs 50.5275 KOps/s 51.4049 KOps/s $\color{#d91a1a}-1.71\%$
test_step_mdp_speed[True-False-True-True-True] 0.1120ms 48.0232μs 20.8233 KOps/s 21.0208 KOps/s $\color{#d91a1a}-0.94\%$
test_step_mdp_speed[True-False-True-True-False] 0.1056ms 29.3442μs 34.0783 KOps/s 34.9529 KOps/s $\color{#d91a1a}-2.50\%$
test_step_mdp_speed[True-False-True-False-True] 72.9010μs 33.4896μs 29.8600 KOps/s 30.3742 KOps/s $\color{#d91a1a}-1.69\%$
test_step_mdp_speed[True-False-True-False-False] 95.8020μs 19.5477μs 51.1569 KOps/s 51.7583 KOps/s $\color{#d91a1a}-1.16\%$
test_step_mdp_speed[True-False-False-True-True] 79.6020μs 49.5852μs 20.1673 KOps/s 20.1785 KOps/s $\color{#d91a1a}-0.06\%$
test_step_mdp_speed[True-False-False-True-False] 0.1047ms 30.8171μs 32.4496 KOps/s 32.6173 KOps/s $\color{#d91a1a}-0.51\%$
test_step_mdp_speed[True-False-False-False-True] 0.2065ms 34.8447μs 28.6988 KOps/s 28.9441 KOps/s $\color{#d91a1a}-0.85\%$
test_step_mdp_speed[True-False-False-False-False] 45.3010μs 21.0985μs 47.3966 KOps/s 46.9845 KOps/s $\color{#35bf28}+0.88\%$
test_step_mdp_speed[False-True-True-True-True] 0.1388ms 48.0814μs 20.7981 KOps/s 20.9567 KOps/s $\color{#d91a1a}-0.76\%$
test_step_mdp_speed[False-True-True-True-False] 61.2010μs 28.8592μs 34.6510 KOps/s 34.1684 KOps/s $\color{#35bf28}+1.41\%$
test_step_mdp_speed[False-True-True-False-True] 0.1072ms 37.3855μs 26.7483 KOps/s 27.1318 KOps/s $\color{#d91a1a}-1.41\%$
test_step_mdp_speed[False-True-True-False-False] 3.5385ms 21.6908μs 46.1025 KOps/s 46.0770 KOps/s $\color{#35bf28}+0.06\%$
test_step_mdp_speed[False-True-False-True-True] 2.0713ms 50.3220μs 19.8720 KOps/s 20.3033 KOps/s $\color{#d91a1a}-2.12\%$
test_step_mdp_speed[False-True-False-True-False] 0.1121ms 30.5874μs 32.6932 KOps/s 32.8470 KOps/s $\color{#d91a1a}-0.47\%$
test_step_mdp_speed[False-True-False-False-True] 84.8020μs 39.2792μs 25.4588 KOps/s 26.0762 KOps/s $\color{#d91a1a}-2.37\%$
test_step_mdp_speed[False-True-False-False-False] 0.3859ms 23.6587μs 42.2678 KOps/s 43.3858 KOps/s $\color{#d91a1a}-2.58\%$
test_step_mdp_speed[False-False-True-True-True] 0.4875ms 52.3478μs 19.1030 KOps/s 19.6652 KOps/s $\color{#d91a1a}-2.86\%$
test_step_mdp_speed[False-False-True-True-False] 58.4010μs 32.4342μs 30.8316 KOps/s 31.0871 KOps/s $\color{#d91a1a}-0.82\%$
test_step_mdp_speed[False-False-True-False-True] 0.1262ms 39.5594μs 25.2785 KOps/s 25.4728 KOps/s $\color{#d91a1a}-0.76\%$
test_step_mdp_speed[False-False-True-False-False] 53.5010μs 23.1426μs 43.2104 KOps/s 43.5991 KOps/s $\color{#d91a1a}-0.89\%$
test_step_mdp_speed[False-False-False-True-True] 0.1241ms 52.4501μs 19.0657 KOps/s 19.0881 KOps/s $\color{#d91a1a}-0.12\%$
test_step_mdp_speed[False-False-False-True-False] 0.1181ms 33.8247μs 29.5642 KOps/s 29.1140 KOps/s $\color{#35bf28}+1.55\%$
test_step_mdp_speed[False-False-False-False-True] 61.8010μs 40.0152μs 24.9905 KOps/s 25.1451 KOps/s $\color{#d91a1a}-0.61\%$
test_step_mdp_speed[False-False-False-False-False] 0.1081ms 24.8976μs 40.1646 KOps/s 41.1439 KOps/s $\color{#d91a1a}-2.38\%$
test_values[generalized_advantage_estimate-True-True] 17.5846ms 13.2842ms 75.2775 Ops/s 74.3249 Ops/s $\color{#35bf28}+1.28\%$
test_values[vec_generalized_advantage_estimate-True-True] 50.0162ms 41.3913ms 24.1597 Ops/s 23.7635 Ops/s $\color{#35bf28}+1.67\%$
test_values[td0_return_estimate-False-False] 0.3603ms 0.1993ms 5.0175 KOps/s 4.7488 KOps/s $\textbf{\color{#35bf28}+5.66\%}$
test_values[td1_return_estimate-False-False] 13.1577ms 12.9413ms 77.2718 Ops/s 77.4402 Ops/s $\color{#d91a1a}-0.22\%$
test_values[vec_td1_return_estimate-False-False] 55.4137ms 42.2792ms 23.6523 Ops/s 24.1341 Ops/s $\color{#d91a1a}-2.00\%$
test_values[td_lambda_return_estimate-True-False] 31.9233ms 31.4179ms 31.8290 Ops/s 32.2944 Ops/s $\color{#d91a1a}-1.44\%$
test_values[vec_td_lambda_return_estimate-True-False] 49.1299ms 41.6726ms 23.9966 Ops/s 23.9107 Ops/s $\color{#35bf28}+0.36\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 14.6946ms 11.7893ms 84.8224 Ops/s 86.4146 Ops/s $\color{#d91a1a}-1.84\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 6.2168ms 3.4907ms 286.4730 Ops/s 286.4033 Ops/s $\color{#35bf28}+0.02\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 2.4942ms 0.4751ms 2.1048 KOps/s 2.0675 KOps/s $\color{#35bf28}+1.80\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 63.8304ms 56.4105ms 17.7272 Ops/s 18.1532 Ops/s $\color{#d91a1a}-2.35\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 9.5641ms 2.8455ms 351.4349 Ops/s 350.3652 Ops/s $\color{#35bf28}+0.31\%$
test_dqn_speed 8.9273ms 1.8372ms 544.3052 Ops/s 531.8064 Ops/s $\color{#35bf28}+2.35\%$
test_ddpg_speed 11.4405ms 2.7881ms 358.6670 Ops/s 208.2320 Ops/s $\textbf{\color{#35bf28}+72.24\%}$
test_sac_speed 14.4548ms 8.1515ms 122.6765 Ops/s 119.8423 Ops/s $\color{#35bf28}+2.36\%$
test_redq_speed 20.2526ms 15.9463ms 62.7106 Ops/s 59.2171 Ops/s $\textbf{\color{#35bf28}+5.90\%}$
test_redq_deprec_speed 20.1568ms 12.8555ms 77.7877 Ops/s 76.4326 Ops/s $\color{#35bf28}+1.77\%$
test_td3_speed 10.9874ms 10.1536ms 98.4872 Ops/s 95.0925 Ops/s $\color{#35bf28}+3.57\%$
test_cql_speed 32.2043ms 26.0500ms 38.3878 Ops/s 36.8682 Ops/s $\color{#35bf28}+4.12\%$
test_a2c_speed 14.6133ms 5.2253ms 191.3762 Ops/s 182.8786 Ops/s $\color{#35bf28}+4.65\%$
test_ppo_speed 13.0165ms 5.6020ms 178.5087 Ops/s 165.6281 Ops/s $\textbf{\color{#35bf28}+7.78\%}$
test_reinforce_speed 12.4208ms 4.0086ms 249.4608 Ops/s 210.2432 Ops/s $\textbf{\color{#35bf28}+18.65\%}$
test_iql_speed 30.0573ms 20.8162ms 48.0394 Ops/s 40.7590 Ops/s $\textbf{\color{#35bf28}+17.86\%}$
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.6104ms 2.6075ms 383.5036 Ops/s 358.0251 Ops/s $\textbf{\color{#35bf28}+7.12\%}$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 4.6348ms 2.7908ms 358.3201 Ops/s 342.3177 Ops/s $\color{#35bf28}+4.67\%$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.1771s 3.3175ms 301.4292 Ops/s 342.4683 Ops/s $\textbf{\color{#d91a1a}-11.98\%}$
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.0559ms 2.6388ms 378.9532 Ops/s 367.8976 Ops/s $\color{#35bf28}+3.01\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 4.8975ms 2.8369ms 352.5020 Ops/s 349.1930 Ops/s $\color{#35bf28}+0.95\%$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 6.0387ms 2.8865ms 346.4363 Ops/s 345.1573 Ops/s $\color{#35bf28}+0.37\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.6290ms 2.6721ms 374.2390 Ops/s 358.3472 Ops/s $\color{#35bf28}+4.43\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 6.6049ms 2.8923ms 345.7470 Ops/s 298.2064 Ops/s $\textbf{\color{#35bf28}+15.94\%}$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.4738ms 2.8987ms 344.9788 Ops/s 343.3221 Ops/s $\color{#35bf28}+0.48\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.5115ms 2.7047ms 369.7229 Ops/s 291.8763 Ops/s $\textbf{\color{#35bf28}+26.67\%}$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 5.1296ms 2.8464ms 351.3205 Ops/s 341.7028 Ops/s $\color{#35bf28}+2.81\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 5.2037ms 2.8343ms 352.8226 Ops/s 346.1869 Ops/s $\color{#35bf28}+1.92\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.0206ms 2.6257ms 380.8461 Ops/s 364.6113 Ops/s $\color{#35bf28}+4.45\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 4.7717ms 2.8511ms 350.7445 Ops/s 342.5463 Ops/s $\color{#35bf28}+2.39\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.7224ms 2.8233ms 354.1965 Ops/s 338.5478 Ops/s $\color{#35bf28}+4.62\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.3200ms 2.6074ms 383.5211 Ops/s 354.3694 Ops/s $\textbf{\color{#35bf28}+8.23\%}$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 4.8342ms 2.7928ms 358.0600 Ops/s 345.1946 Ops/s $\color{#35bf28}+3.73\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 5.0015ms 2.7815ms 359.5220 Ops/s 339.1715 Ops/s $\textbf{\color{#35bf28}+6.00\%}$
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.2447s 31.4848ms 31.7614 Ops/s 31.0012 Ops/s $\color{#35bf28}+2.45\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 0.1436s 26.8443ms 37.2518 Ops/s 36.5031 Ops/s $\color{#35bf28}+2.05\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 0.1469s 29.4377ms 33.9700 Ops/s 33.4621 Ops/s $\color{#35bf28}+1.52\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1449s 27.0529ms 36.9646 Ops/s 33.0623 Ops/s $\textbf{\color{#35bf28}+11.80\%}$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 0.1468s 29.4381ms 33.9695 Ops/s 36.1432 Ops/s $\textbf{\color{#d91a1a}-6.01\%}$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 0.1486s 26.8233ms 37.2810 Ops/s 32.8302 Ops/s $\textbf{\color{#35bf28}+13.56\%}$
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1418s 29.1127ms 34.3493 Ops/s 35.6897 Ops/s $\color{#d91a1a}-3.76\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1435s 26.8081ms 37.3022 Ops/s 33.0491 Ops/s $\textbf{\color{#35bf28}+12.87\%}$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 0.1428s 29.2944ms 34.1362 Ops/s 36.9227 Ops/s $\textbf{\color{#d91a1a}-7.55\%}$

@vmoens vmoens linked an issue Sep 12, 2023 that may be closed by this pull request
3 tasks
@vmoens
Copy link
Contributor Author

vmoens commented Sep 12, 2023

On my mcbook, I get to 14-16K fps with 4 async collectors x 8 envs/collector

With a sync collector i reach 20k fps.

Compared with just an async env of 32 workers, which plateau at 12K fps.

@matteobettini
Copy link
Contributor

matteobettini commented Sep 12, 2023

should we add a "_reset" key too?

Without that how would you be able to do a reset?
you need to reset a batched env as soon as it is done no? or does gym take care of that internally?
what does it use to know whats done?

@vmoens
Copy link
Contributor Author

vmoens commented Sep 12, 2023

should we add a "_reset" key too?

Without that how would you be able to do a reset? you need to reset a batched env as soon as it is done no? or does gym take care of that internally? what does it use to know whats done?

@vmoens vmoens closed this Sep 12, 2023
@vmoens vmoens reopened this Sep 12, 2023
@vmoens
Copy link
Contributor Author

vmoens commented Sep 13, 2023

Without that how would you be able to do a reset? you need to reset a batched env as soon as it is done no? or does gym take care of that internally? what does it use to know whats done?

@matteobettini
I think in most of the code base we define the reset at the point where we read the done. It's internal facing (hence the _ prefix). So I don't think it should be part of the output here. Any idea where it would break?

@matteobettini
Copy link
Contributor

Any idea where it would break?

I have now read the docs at https://gymnasium.farama.org/api/vector/.
It seems that the vector env auto resets the done sub_envs internally.

Given this, we might need to be careful about a few things:
- any call to env.reset() in a gym vectorised env will reset all the sub_envs in it and does not take any flag to ask which ones to reset.
- our calls to env.reset()

td_reset = self.env.reset(td_reset)

are passing the internal _reset flag that contains the sub_envs to reset. It seems to me that in the wrapper for gym vector envs we should call env.reset() only if all the _reset dimensions are set. This is because it seems that all the other cases for partial resets are handled internally.

Does it make sense? I might be missing some gymnasium knowledge but this is what i gathered from the docs

@vmoens
Copy link
Contributor Author

vmoens commented Sep 13, 2023

Any idea where it would break?

I have now read the docs at https://gymnasium.farama.org/api/vector/. It seems that the vector env auto resets the done sub_envs internally.

This isn't great IMO: does it mean that we must tell collectors to ignore done states then? I was hoping to deprecate that feature in the next release haha.

EDIT: It's actually even worse than that. Say one user wants to use a data collector, with or without async gym env. She needs to dig into both classes to understand that the auto-reset of the collector conflicts with the one of the env. So many silent bugs can arise from this! It isn't really about torchrl but about any loop like

for _ in range(N):
    obs, ... done, ... = env.step()
    if any(done):
        env.reset(done)

That seems quite generic to me. The fact that an env behaves differently when executed in parallel or not is unexpected and weird IMO. But I might be missing something. Curious about your thoughts @pseudo-rnd-thoughts
In particular, I don't understand why Gymnasium doesn't simply allow users to do the reset envs themselves, through a mask or a list of integers? reset and step are literally the only 2 building blocks of a gym env, taking one away seems a bit distressing to me.
Don't get me wrong, had it been the convention from the beginning that resets are automatically performed I would have been ok with the feature, but here we have envs executed in batch that differs from env executed alone, hence introducing an unecessary control flow without many users' codebases.

For the GymWrapper, regarding this line I think we can just do an assertion of whether or not all reset are true or all false, and if it isn't the case we raise an exception saying that it isn't supported (ie, cannot partially reset a vectorized env).

What about frame-skip? Because we can't control the reset, and we can't avoid doing steps in some sub-envs, that prevents any implementation of frame-skip at our level, right? I think both the internal frame-skip and the transform will not be able to handle that...

@albertbou92 want to chime in and share your thoughts?

@vmoens
Copy link
Contributor Author

vmoens commented Sep 13, 2023

EDIT: It's actually even worse than that. Say one user wants to use a data collector, with or without async gym env. She needs to dig into both classes to understand that the auto-reset of the collector conflicts with the one of the env. So many silent bugs can arise from this! It isn't really about torchrl but about any loop like

One possible solution is to have (yet another!) attribute within the env like env.auto_reset that tells the collector if the env is or isn't resetting automatically. That could also be used to determine if frame-skip can be used.

@vmoens
Copy link
Contributor Author

vmoens commented Sep 13, 2023

Another issue is that the auto-reset screws up the next observation of the last step of the env, which can be found in info["terminal_observation"]. I'm not convinced that this is easier to handle for the users than gathering that obs as they always do and call reset on their own but I don't have any agency on these design choices.
So we also need a custom info reader to swap the next obs in these cases, but I'm not sure of how we're supposed to write the actual obs at the root...

@albertbou92
Copy link
Contributor

albertbou92 commented Sep 13, 2023

The fact that gymnasium vector environments auto-reset makes frame-skipping behave differently but IMO It will not change training behavior that much. It will apply the same action N time with or without a reset in between, as opposed to the normal behavior of finishing the loop if end of episode is reached. However, I agree it seems to complicate the code here and there and messes up with with the current code.

Ideally I would think about implementing our own version and allowing for manual reset of subsets of environments, but maybe it is overkill.

@vmoens
Copy link
Contributor Author

vmoens commented Sep 13, 2023

This code should now work through c798307

import torch

from torchrl.envs.libs.gym import GymEnv

if __name__ == "__main__":
    env = GymEnv("CartPole-v1", num_envs=4)
    print("env", env)
    r = env.rollout(100, break_when_any_done=False)
    print(r)
    done = r[:, :-1]["next", "done"].squeeze(-1)
    assert done.any()
    # data resulting from step, when it's not done
    r_not_done = r[:, :-1]["next"][~done]
    # data resulting from step, when it's not done, after step_mdp
    r_not_done_tp1 = r[:, 1:][~done]
    torch.testing.assert_close(
        r_not_done["observation"],
        r_not_done_tp1["observation"]
        )

    # data resulting from step, when it's done
    r_done = r[:, :-1]["next"][done]
    # data resulting from step, when it's done, after step_mdp and reset
    r_done_tp1 = r[:, 1:][done]
    assert ((r_done["observation"] - r_done_tp1["observation"]).norm(
        dim=-1
        ) > 1e-1).all(), (
            r_done["observation"] - r_done_tp1["observation"]).norm(dim=-1)

making the behaviour consistent with other envs

cc @matteobettini @albertbou92

@pseudo-rnd-thoughts
Copy link

pseudo-rnd-thoughts commented Sep 14, 2023

Sorry, I'm a bit confused by your comment

Currently, Gym and Gymnasium async and sync vector implementation have autoreset enabled with no way of disabling it.
It would be possible to add a parameter to the implementation to disable and I wouldn't have an issue. Very happy to accept a PR. An alternative solution would be for you to implement this yourself but I understand this is a lot of work.

For the frameskipping, to me, I don't think doing this at the vector level is a good idea but rather it should be done at the sub-environment level. You should be able to apply wrappers to any of the sub-environments used in our async and sync vector environment.

Now for the interesting news, in v1.0.0 (soon to be released), we have separated the Env and VectorEnv, (previously VectorEnv inherited from Env which made it difficult for type checking and some wrapper work with a vector env and some did not, by separating this clears it up). This shouldn't cause any strong issues for you I suspect. The annoyance is more that we are strongly considering modifying how the autoreset is achieved (and most likely doing it, unless there is a reason not considered already). See Farama-Foundation/Gymnasium#694 for all of the details, in short, we are moving from autoresetting happening within the same step from when an episode has ended to resetting in the next step. I'm not sure how much this affects the current conversation but I thought I would note it.

I find vector environment complex to work out what is happening and I'm happy to talk about this on discord otherwise I go up to London fairly regularly if it is easier to talk in person.

@vmoens
Copy link
Contributor Author

vmoens commented Sep 14, 2023

Thanks for this @pseudo-rnd-thoughts that's very useful.
My very personal view is that auto-reset isn't as desirable as that. It hampers flexibility and things like knowing what is the actual observation resulting from my action or implementing frame-skip is harder. It also breaks the consistency one could hope for between single and multi-processed envs (ie it feels weird that CartPole executed on 2 processors does not behave like CartPole on one).
I also noticed that info["terminal_observation"] can return a list of None + the final observation or just a plain None, again introducing some inconsistency that is hard to deal with without some appropriate control flow that makes code bases harder to test and generalize. For instance, an observation space encoded as a Dict (like in gymnasium-robotics) gives you a Dict with batched numpy arrays in the values. But the final observation is a list of dicts, so the nesting order is flipped. This is another example of inconsistency that is hard to foresee without running the env for yourself.
I can see the convenience and speedup it can bring but the price to pay seems a bit high.

I've made the workaround in this PR and now torchrl sort of "ignores" the auto-reset by caching the data resulting from the reset and delivering the data from the info. When calling reset with the appropriate mask, you will get the cached data.
Our goal is that if you do GymEnv('CartPole-v1') or GymEnv('CartPole-v1', num_envs=2) the only thing that will differ is the batch size (you data in the second case has a leading dim of 2). All the rest (structure of the data, calls to reset etc) will be unchanged. That way you can hopefully swap one with the other without much effort. Is there anything we're not considering?

I am probably missing some important things in the API design of gymnasium and I'd be happy to know more and understand why the vectorized envs are written that way.

Happy to meet in person when you're in London!

@pseudo-rnd-thoughts
Copy link

Thanks for the fast reply, I will send you an email when I'm next travelling to London which might next longer than expected (mid November).

I understand what you intend with GymEnv(CartPole-v1, num_envs=2), this seems like a good idea.
I would just note that you might want to consider if / how you would incorporate projects like EnvPool or SampleFactory that provide ~1 million atari fps and ~3 million mujoco fps.

For the "final_observation", I'm a bit surprised by your issues, my understanding is that "final_observation" should contain a list of observation for the last observation for a given sub-environment. Therefore, with 3 environments where the first two terminate then the final observation should be [env 0 obs, env 1 obs, None].

For the observation shape, to optimize observations for neural networks, we batch (concatenate) observations, i.e., to batch dict observation, we provide a dict of batch sub-spaces. See https://gymnasium.farama.org/api/spaces/vector_utils/#gymnasium.vector.utils.batch_space for this documentation about this. I can add more examples to help.
This is also the reason why for Box / Discrete spaces batched are just "normal" stacked arrays.
Similarly to autoresetting, we would be very happy to disable this batching to just return a tuple of observations.

I'm what is the function type of GymEnv.step? Do you use NumPy array for rewards and terminated / truncated (done) signal?

@vmoens
Copy link
Contributor Author

vmoens commented Sep 14, 2023

I understand what you intend with GymEnv(CartPole-v1, num_envs=2), this seems like a good idea. I would just note that you might want to consider if / how you would incorporate projects like EnvPool or SampleFactory that provide ~1 million atari fps and ~3 million mujoco fps.

Thanks for the suggestion. Have a look at this

For the "final_observation", I'm a bit surprised by your issues, my understanding is that "final_observation" should contain a list of observation for the last observation for a given sub-environment. Therefore, with 3 environments where the first two terminate then the final observation should be [env 0 obs, env 1 obs, None].

Why not an observation that matched the one found in the first element of the tuple returned by step, with 0 value for all obs that are not final? What puzzles me is that the data is presented in 2 different ways. But the root "problem" is the choice of auto-resetting the env IMO so any patch to that is going to be unsatisfactory for some users. I personally would have expected for final_observation to match the data structure of the env such that I can simply mask things and update my data accordingly.

I'm what is the function type of GymEnv.step? Do you use NumPy array for rewards and terminated / truncated (done) signal?

GymEnv.step always returns a tensordict. All data is presented as pytorch torch.Tensor which facilitates integration in training loops, storage and device casting, as well as distributed communications.
We make sure that the data is always presented in a uniform, unsurprising way. So far, we have not really seen any use case where this choice was more restrictive than allowing for arrays, tuples, dictionaries and the like. The only one is when an observation is a tuple of very heterogeneous stuff (like tuple(tuple(dict, dict), dict)) but at that point one is looking for trouble IMO as the data is presented in a way where ordering is a proxy to semantic functionality differences (simply put, they should be using dictionaries to make sure that they know what they're working with rather than lists of values that are semantically different). @smorad had an issue about that recently.

@vmoens
Copy link
Contributor Author

vmoens commented Sep 14, 2023

@pseudo-rnd-thoughts I guess the hope of users such as me for vec-env is that it behaves like pytorch dataloaders.
If you specify num_workers=0 or num_workers=16 this will not affect your training loop in any way (except for a possible if __name__ == "__main__" decorator).

The use case is different, as dataloaders will deliver the exact same datastructure than vec envs (ie you won't get 16 batches at a time if you ask for 16 workers) but sill, the fact that speeding up my code minimanlly disrupts how I code is an amazing feature. It means I can just ask for more resources and test things locally without being surprised at any point.
If I want to achieve the same thing with gymnasium, I either have to always build a vectorized env (like in cleanrl) even if I just have 1 copy to run, or i need a bunch of if/else everywhere in the code. The first option is what I would assume many people do, which I don't think facilitates app development (it's adding a "useless" wrapper and makes debugging harder as checkpointing multiprocessed code isn't easy).

I'm afraid that the choice of creating 2 different classes for single and vec envs in gymnasium 1.0 goes even further away from the interchangeability we're hoping for. Simply look at the amount of effort it takes for us to cover single and mp envs within a single class in TorchRL. Having 2 different env classes for these will be even more difficult.

At the age of jax and functorch, why do we still need to differentiate single and multiprocessed envs? Why can't we vectorize all ops? Like you were saying with envpool, if we were relying on modern solutions we could just tell an env what its batch size is and execute ops in a vectorized way and call it a day. This is the direction that many people want to see (take Isaac or VMAS for instance).

@vmoens vmoens marked this pull request as ready for review September 14, 2023 16:44
@pseudo-rnd-thoughts
Copy link

Good questions particularly about scaling to vmapping over stuff and the problem of scaling code.
@RedTachyon might have some helpful thoughts here.

I think the primary answer would be backward compatibility and this is the way that gym was designed, back in the today when people were just trying to get RL to work with a single environment. But I agree that in the modern age (5 years later), Gym is starting to show its limitations.

One of our aims was to separate out Env from VectorEnv is to make the changes between the two more consistent as previously there were a number of small changes that made the change unreliable.

Otherwise designing your algorithm to use vector environment to begin with, then I don't think it is easy to scale in the way that you are discussing but I certainly agree that this is an important area of research

@vmoens vmoens added the Environments Adds or modifies an environment wrapper label Sep 15, 2023
Copy link
Contributor

@matteobettini matteobettini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM few comments

torchrl/envs/libs/gym.py Show resolved Hide resolved
torchrl/envs/common.py Outdated Show resolved Hide resolved
torchrl/envs/libs/gym.py Show resolved Hide resolved
torchrl/envs/common.py Outdated Show resolved Hide resolved
@vmoens
Copy link
Contributor Author

vmoens commented Sep 17, 2023

Merging with the following failing tests:

  • MacOs (unrelated)
  • RoboHive (unrelated)
  • Habitat (unrealted)

@vmoens vmoens merged commit fe91c4f into main Sep 17, 2023
16 of 47 checks passed
@vmoens vmoens deleted the gym_vecenvs branch September 17, 2023 19:33
@skandermoalla
Copy link
Contributor

Awesome!🚀🚀 Any pointers to how to use this feature?

albertbou92 pushed a commit to PyTorchRL/rl that referenced this pull request Sep 18, 2023
vmoens added a commit to hyerra/rl that referenced this pull request Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request Environments Adds or modifies an environment wrapper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Throughput vs Gym AsyncVectorEnv
6 participants