You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am new to TorchRL and am trying to use it to train a PPO algorithm in Unity-MLAgents. Currently, I am just trying to get a head_balance example scene running but have been having some difficulty using the env as it does not line up with the setup from the other tutorials.
The UnityMLAgentsEnv is working and returns an env with the 12 agents in the scene for the head balance. Like the UnityMLAgentsEnv Docs suggest in their example, each agent is inside one group in the TensorDict and each has its own fields such as continuous_action and the rollout works.
The problem however, is that the keys are not like either the Multiagent PPO Tutorial or the Multiagent DDPG Tutorial and I cannot find an example of how I can go about this format. In both tutorials, the expected keys for the other environments are ('agent', 'action'), ('agent', observation), etc, being that all agents are homogeneous and stacked into one vector right from the environment. The MLAgents head_balance example is not stacked and so I am not sure how to correctly apply the individual agent keys to the Policy or Critic modules.
I have been working on getting this example up and running for a little while and find myself stuck with how to correctly interface this style of environment with the different modules. Could I please get some advice or direction on how to go about this?
P.S. if I can get the head_balance working with TorchRL and the UnityMLAgentsEnv, I would be more than happy to open a pull request and contribute it for others to avoid the same headaches.
Setup:
python3.12
torchrl==0.6.0
tensordict==0.6.1
mlagents==0.28.0
mlagents-env==0.28.0
Code:
importmultiprocessingimporttorchfromtensordict.nnimportTensorDictModule, TensorDictSequentialfromtensordict.nn.distributionsimportNormalParamExtractorfromtorchimportnnfromtorchrl.collectorsimportSyncDataCollectorfromtorchrl.data.replay_buffersimportReplayBufferfromtorchrl.data.replay_buffers.samplersimportSamplerWithoutReplacementfromtorchrl.data.replay_buffers.storagesimportLazyTensorStoragefromtorchrl.envsimport (
Compose,
TransformedEnv,
RewardSum
)
fromtorchrl.envsimportUnityMLAgentsEnv, MarlGroupMapTypefromtorchrl.envs.utilsimportcheck_env_specsfromtorchrl.modulesimportMultiAgentMLP, ProbabilisticActor, TanhNormal, AdditiveGaussianModulefromtqdmimporttqdm# Devicesis_fork=multiprocessing.get_start_method() =="fork"device= (
torch.device(0)
iftorch.cuda.is_available() andnotis_forkelsetorch.device("cpu")
)
# Samplingframes_per_batch=6_000# Number of team frames collected per training iterationn_iters=10# Number of sampling and training iterationstotal_frames=frames_per_batch*n_iters# Trainingnum_epochs=30# Number of optimization steps per training iterationminibatch_size=400# Size of the mini-batches in each optimization steplr=3e-4# Learning ratemax_grad_norm=1.0# Maximum norm for the gradients# PPOclip_epsilon=0.2# clip value for PPO lossgamma=0.99# discount factorlmbda=0.9# lambda for generalised advantage estimationentropy_eps=1e-4# coefficient of the entropy term in the PPO lossbase_env=UnityMLAgentsEnv(registered_name="3DBall", device=device, group_map=MarlGroupMapType.ALL_IN_ONE_GROUP)
env=TransformedEnv(
base_env,
RewardSum(
in_keys=[keyforkeyinbase_env.reward_keysifkey[2] =="reward"], # exclude group rewardreset_keys=base_env.reset_keys
)
)
check_env_specs(base_env)
n_rollout_steps=5rollout=env.rollout(n_rollout_steps)
share_parameters_policy=Truepolicy_net=nn.Sequential(
MultiAgentMLP(
n_agent_inputs=env.observation_spec['agents']['agent_0']['observation_0'].shape[-1],
n_agent_outputs=env.action_spec['agents']['agent_0']['continuous_action'].shape[-1],
n_agents=len(env.group_map['agents']),
centralised=False,
share_params=share_parameters_policy,
device=device,
depth=2,
num_cells=256,
activation_class=nn.Tanh
),
NormalParamExtractor(),
)
policy_module=TensorDictModule(
policy_net,
in_keys=[("agents", agent, "observation_0") foragentinenv.group_map["agents"]],
out_keys=[("agents", agent, "action_param") foragentinenv.group_map["agents"]],
)
policy=ProbabilisticActor(
module=policy_module,
spec=env.full_action_spec["agents", "agent_0", "continuous_action"],
in_keys=[("agents", agent, "action_param") foragentinenv.group_map["agents"]],
out_keys=[("agents", agent, "continuous_action") foragentinenv.group_map["agents"]],
distribution_class=TanhNormal,
distribution_kwargs={
"low": env.action_spec['agents']['agent_0']['continuous_action'].space.low,
"high": env.action_spec['agents']['agent_0']['continuous_action'].space.high,
},
return_log_prob=False,
)
Error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 policy = ProbabilisticActor(
2 module=policy_module,
3 spec=env.full_action_spec["agents", "agent_0", "continuous_action"],
4 in_keys=[("agents", agent, "action_param") for agent in env.group_map["agents"]],
5 out_keys=[("agents", agent, "continuous_action") for agent in env.group_map["agents"]],
6 distribution_class=TanhNormal,
7 distribution_kwargs={
8 "low": env.action_spec['agents']['agent_0']['continuous_action'].space.low,
9 "high": env.action_spec['agents']['agent_0']['continuous_action'].space.high,
10 },
11 return_log_prob=False,
12 )
File c:\Users\ky097697\Development\distributed-rl-framework\venv\Lib\site-packages\torchrl\modules\tensordict_module\actors.py:390, in ProbabilisticActor.__init__(self, module, in_keys, out_keys, spec, **kwargs)
385 if len(out_keys) == 1 and spec is not None and not isinstance(spec, Composite):
386 spec = Composite({out_keys[0]: spec})
388 super().__init__(
389 module,
--> 390 SafeProbabilisticModule(
391 in_keys=in_keys, out_keys=out_keys, spec=spec, **kwargs
392 ),
393 )
File c:\Users\ky097697\Development\distributed-rl-framework\venv\Lib\site-packages\torchrl\modules\tensordict_module\probabilistic.py:132, in SafeProbabilisticModule.__init__(self, in_keys, out_keys, spec, safe, default_interaction_type, distribution_class, distribution_kwargs, return_log_prob, log_prob_key, cache_dist, n_empirical_estimate)
130 elif spec is not None and not isinstance(spec, Composite):
131 if len(self.out_keys) > 1:
--> 132 raise RuntimeError(
133 f"got more than one out_key for the SafeModule: {self.out_keys},\nbut only one spec. "
134 "Consider using a Composite object or no spec at all."
135 )
136 spec = Composite({self.out_keys[0]: spec})
137 elif spec is not None and isinstance(spec, Composite):
RuntimeError: got more than one out_key for the SafeModule: [('agents', 'agent_0', 'continuous_action'), ('agents', 'agent_1', 'continuous_action'), ('agents', 'agent_2', 'continuous_action'), ('agents', 'agent_3', 'continuous_action'), ('agents', 'agent_4', 'continuous_action'), ('agents', 'agent_5', 'continuous_action'), ('agents', 'agent_6', 'continuous_action'), ('agents', 'agent_7', 'continuous_action'), ('agents', 'agent_8', 'continuous_action'), ('agents', 'agent_9', 'continuous_action'), ('agents', 'agent_10', 'continuous_action'), ('agents', 'agent_11', 'continuous_action')],
but only one spec. Consider using a Composite object or no spec at all.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello there,
I am new to TorchRL and am trying to use it to train a PPO algorithm in Unity-MLAgents. Currently, I am just trying to get a head_balance example scene running but have been having some difficulty using the env as it does not line up with the setup from the other tutorials.
The UnityMLAgentsEnv is working and returns an env with the 12 agents in the scene for the head balance. Like the UnityMLAgentsEnv Docs suggest in their example, each agent is inside one group in the TensorDict and each has its own fields such as continuous_action and the rollout works.
The problem however, is that the keys are not like either the Multiagent PPO Tutorial or the Multiagent DDPG Tutorial and I cannot find an example of how I can go about this format. In both tutorials, the expected keys for the other environments are ('agent', 'action'), ('agent', observation), etc, being that all agents are homogeneous and stacked into one vector right from the environment. The MLAgents head_balance example is not stacked and so I am not sure how to correctly apply the individual agent keys to the Policy or Critic modules.
I have been working on getting this example up and running for a little while and find myself stuck with how to correctly interface this style of environment with the different modules. Could I please get some advice or direction on how to go about this?
P.S. if I can get the head_balance working with TorchRL and the UnityMLAgentsEnv, I would be more than happy to open a pull request and contribute it for others to avoid the same headaches.
Setup:
Code:
Error
Beta Was this translation helpful? Give feedback.
All reactions