Change autoreset order such that reset will happen on the next step #785
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Vector environments have two choices for implementing an autoreset functionality
"final_observation"
and"final_info"
.terminated=True
there is technically a dead action that does nothing.An important note is that it is not possible to convert between reset orderings, i.e., if you have a vector environment using implementation 1, you can't run some code that makes it looks like 2, similarly for 2 you can't convert to 1 (to my knowledge).
Previously, gym and gymnasium have used the first option however after a long time thinking about this, I believe that v1.0.0 is the best time for us to change to type 2.
Why?
I have three main reasons
vector-only projects
My primary motivating factor is thinking about pure vector-only projects like EnvPool and SampleFactory, for optimisation reasons, it is (highly) inefficient to store data in a dictionary compared to a NumPy array. For this reason, from my knowledge, all vector-only projects use functionality 2 to autoreset environments.
Why does this matter for Gymnasium? In v1.0.0, we have separated
Env
fromVectorEnv
such that neither inherits from the other. As a result, vector-only wrappers have been created to be used with vector environments (normal wrappers can be used inside vector environments for async and sync vector environments however for EnvPool and SampleFactory, this is not possible). Therefore, given the note above about interoperability, if we continue with functionality 1, our vector wrappers cannot be used with these important vector-only projects.more elegant training code
As originally noted in #32 (comment), functionality 1 requires relatively ugly code compared to functionality 2
This is particularly true for vector environments; see https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py#L200
Simplifies
VectorObservationWrapper
VectorObservationWrapper
requires two observation functions in order to transform an observation, a vector observation and a single observation as the final observation must be transformed on its own.But with functionality two, then there is no separate final observation that must be transformed
Why not?
Unlike the step API in v0.25, there is no obvious way of telling if a vector environment has implemented functionality 1 or 2 or if the training code is adapted to 1 or 2. This is an issue that cannot be avoided but with v1.0.0, this is why we can only make it now
Second, users who use Gymnasium's async or sync vector env will have to update their usage. However, this will be obvious as
"final_observation"
and"final_info"
will no longer existUpdating info concatenation and decomposition
By removing
"final_observation"
or"final_info"
from info, we can implement a recursive vector info support, in particular, for vectoringRecordEpisodeStatistics
.Using
envs = gym.make_vec("CartPole-v1", num_envs=3, vectorization_mode="sync", wrappers=(gym.wrappers.RecordEpisodeStatistics,))
, when an sub-environment terminates / truncates, the info is{"episode": np.array([{"r": 1, "t": 2, "l": 3}, None, None], dtype=object)}
. However this means to access the episode information, you need to doinfo["final_info"][i]["episode"]["r"]
for each of the sub-environment.Rather, with functionality 2, you can do,
info["episode"]["r"]
for all of the sub-environment cumulative rewards (if they have terminated / truncated).Furthermore, it is necessary to update
DictToListInfo
to support the reverse operation of this.To-do
Completion of #694