Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change autoreset order for wrapper and vector environments #694

Conversation

pseudo-rnd-thoughts
Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts commented Aug 28, 2023

Description

Vector environments have two choices for implementing an autoreset functionality

  1. During the same step as a sub-environment has terminated / truncated, we can reset the sub-environment however this forces the terminated / truncated, referred to as final, observation and info to be stored in the info as "final_observation" and "final_info".
  2. During the next step after a sub-environment has terminated / truncated, we can reset the sub-environment however in cases where terminated=True there is technically a dead action that does nothing.

An important note is that it is not possible to convert between reset orderings, i.e., if you have a vector environment using implementation 1, you can't run some code that makes it looks like 2, similarly for 2 you can't convert to 1 (to my knowledge).

Previously, gym and gymnasium have used the first option however after a long time thinking about this, I believe that v1.0.0 is the best time for us to change to type 2.

Why?

I have three main reasons

vector-only projects

My primary motivating factor is thinking about pure vector-only projects like EnvPool and SampleFactory, for optimisation reasons, it is (highly) inefficient to store data in a dictionary compared to a NumPy array. For this reason, from my knowledge, all vector-only projects use functionality 2 to autoreset environments.

Why does this matter for Gymnasium? In v1.0.0, we have separated Env from VectorEnv such that neither inherits from the other. As a result, vector-only wrappers have been created to be used with vector environments (normal wrappers can be used inside vector environments for async and sync vector environments however for EnvPool and SampleFactory, this is not possible). Therefore, given the note above about interoperability, if we continue with functionality 1, our vector wrappers cannot be used with these important vector-only projects.

more elegant training code

As originally noted in #32 (comment), functionality 1 requires relatively ugly code compared to functionality 2
This is particularly true for vector environments; see https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py#L200

Simplifies VectorObservationWrapper

VectorObservationWrapper requires two observation functions in order to transform an observation, a vector observation and a single observation as the final observation must be transformed on its own.
But with functionality two, then there is no separate final observation that must be transformed

Why not?

Unlike the step API in v0.25, there is no obvious way of telling if a vector environment has implemented functionality 1 or 2 or if the training code is adapted to 1 or 2. This is an issue that cannot be avoided but with v1.0.0, this is why we can only make it now

Second, users who use Gymnasium's async or sync vector env will have to update their usage. However, this will be obvious as "final_observation" and "final_info" will no longer exist

Updating info concatenation and decomposition

By removing "final_observation" or "final_info" from info, we can implement a recursive vector info support, in particular, for vectoring RecordEpisodeStatistics.

Using envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync", wrappers=(gym.wrappers.RecordEpisodeStatistics,)), when an sub-environment terminates / truncates, the info is {"episode": np.array([{"r": 1, "t": 2, "l": 3}, None, None], dtype=object)}. However this means to access the episode information, you need to do info["final_info"][i]["episode"]["r"] for each of the sub-environment.

Rather, with functionality 2, you can do, info["episode"]["r"] for all of the sub-environment cumulative rewards (if they have terminated / truncated).

Furthermore, it is necessary to update DictToListInfo to support the reverse operation of this.

pseudo-rnd-thoughts and others added 30 commits July 15, 2023 14:28
# Conflicts:
#	docs/environments/third_party_environments.md
#	gymnasium/core.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants