-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DQN for Acrobot and MountainCar with enhanced exploration #64
Conversation
- Implement enhanced exploration strategy with momentum-based decisions - Add position and velocity-based reward structure - Increase success reward and add progressive rewards for near-goal states - Track success positions and epsilon history - Achieve 5/5 successful episodes in play mode with consistent performance
Reviewer's Guide by SourceryThis PR implements two reinforcement learning agents for the Acrobot and MountainCar environments using Deep Q-Networks (DQN). The implementation includes several advanced features like Double DQN, Experience Replay, and enhanced exploration strategies. Both agents use curriculum learning and sophisticated reward shaping to improve training efficiency. Sequence diagram for AcrobotAI training processsequenceDiagram
actor Trainer
participant AcrobotAI
participant Environment
participant ReplayMemory
participant DQN
Trainer->>AcrobotAI: train(num_episodes)
loop for each episode
AcrobotAI->>Environment: reset()
AcrobotAI->>DQN: select_action(state)
DQN-->>AcrobotAI: action
AcrobotAI->>Environment: step(action)
Environment-->>AcrobotAI: next_state, reward, done
AcrobotAI->>ReplayMemory: push(Experience)
AcrobotAI->>DQN: optimize_model(experiences)
DQN-->>AcrobotAI: update policy
end
AcrobotAI-->>Trainer: training complete
Sequence diagram for MountainCarAI training processsequenceDiagram
actor Trainer
participant MountainCarAI
participant Environment
participant ReplayMemory
participant DQN
Trainer->>MountainCarAI: train(num_episodes)
loop for each episode
MountainCarAI->>Environment: reset()
MountainCarAI->>DQN: select_action(state)
DQN-->>MountainCarAI: action
MountainCarAI->>Environment: step(action)
Environment-->>MountainCarAI: next_state, reward, done
MountainCarAI->>ReplayMemory: push(Experience)
MountainCarAI->>DQN: optimize_model()
DQN-->>MountainCarAI: update policy
end
MountainCarAI-->>Trainer: training complete
Class diagram for AcrobotAI and MountainCarAIclassDiagram
class DQN {
+Linear fc1
+Linear fc2
+Linear fc3
+forward(Tensor x) Tensor
}
class ReplayMemory {
+deque memory
+push(Experience experience)
+sample(int batch_size) List~Experience~
+int __len__()
}
class AcrobotAI {
+DQN policy_net
+DQN target_net
+ReplayMemory memory
+select_action(Tensor state) Tensor
+optimize_model(List~Experience~ experiences)
+train(int num_episodes) bool
+play(int episodes)
}
class MountainCarAI {
+DQN policy_net
+DQN target_net
+ReplayMemory memory
+select_action(Tensor state) int
+optimize_model()
+train(int num_episodes) bool
+play(int episodes)
}
DQN <|-- AcrobotAI
DQN <|-- MountainCarAI
ReplayMemory <|-- AcrobotAI
ReplayMemory <|-- MountainCarAI
Experience <|-- ReplayMemory
Experience : state
Experience : action
Experience : reward
Experience : next_state
Experience : done
class Experience {
+state
+action
+reward
+next_state
+done
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @leonvanbokhorst - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
cos_theta2, sin_theta2 = state[2].item(), state[3].item() | ||
tip_height = -cos_theta1 - cos_theta2 | ||
|
||
# More exploration when stuck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): stuck_counter is used before initialization
Initialize stuck_counter in init to avoid potential runtime errors.
done = terminated or truncated | ||
steps += 1 | ||
|
||
# Calculate height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider extracting reward calculation logic into a dedicated method to improve code organization.
The reward calculation logic can be simplified by extracting it into a dedicated method. This would improve readability while maintaining all functionality. Here's a suggested refactor:
def calculate_reward(self, tip_height: float, target_height: float, prev_max_height: float) -> tuple[float, bool]:
reward = 0
episode_success = False
# Base curriculum reward
if tip_height > target_height:
reward += 10.0
# Progressive rewards
reward += (tip_height - target_height) * 5.0
# Previous max height bonus
if tip_height > prev_max_height:
reward += 5.0
self.stuck_counter = 0
else:
self.stuck_counter += 1
# Success reward
if tip_height > SUCCESS_HEIGHT:
episode_success = True
reward += 100.0
return reward - 0.1, episode_success # Include time penalty
Then simplify the training loop:
while not done:
action = self.select_action(state)
next_state, _, terminated, truncated, _ = self.env.step(action.item())
done = terminated or truncated
steps += 1
tip_height = -(next_state[0] + next_state[2]) # Simplified height calc
max_height = max(max_height, tip_height)
reward, episode_success = self.calculate_reward(
tip_height, target_height, prev_max_height
)
self.memory.push(Experience(state, action, reward, next_state, done))
# ... rest of training loop
This separates reward calculation concerns from the main training flow while keeping all the existing logic intact.
torch.nn.utils.clip_grad_value_(self.policy_net.parameters(), 100) | ||
self.optimizer.step() | ||
|
||
def train(self, num_episodes: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Hoist statements out of for/while loops (
hoist-statement-from-loop
) - Low code quality found in AcrobotAI.train - 14% (
low-code-quality
)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
ai = AcrobotAI() | ||
|
||
print("🏃♂️ Training the agent...") | ||
solved = ai.train(num_episodes=1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use named expression to simplify assignment and conditional (use-named-expression
)
|
||
position, velocity = state | ||
|
||
if random.random() > epsilon: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Merge else clause's nested if statement into elif (merge-else-if-into-elif
)
torch.nn.utils.clip_grad_value_(self.policy_net.parameters(), 100) | ||
self.optimizer.step() | ||
|
||
def train(self, num_episodes: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replace if statement with if expression (
assign-if-exp
) - Low code quality found in MountainCarAI.train - 9% (
low-code-quality
)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
ai = MountainCarAI() | ||
|
||
print("\n🏃♂️ Training the agent...") | ||
solved = ai.train(num_episodes=1000) # Increased max episodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use named expression to simplify assignment and conditional (use-named-expression
)
Summary by Sourcery
Implement Double DQN architectures for Acrobot-v1 and MountainCar-v0 environments, introducing enhanced exploration strategies and reward structures.
New Features:
Enhancements: