Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DQN for Acrobot and MountainCar with enhanced exploration #64

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

leonvanbokhorst
Copy link
Owner

@leonvanbokhorst leonvanbokhorst commented Nov 26, 2024

  • Implement enhanced exploration strategy with momentum-based decisions
  • Add position and velocity-based reward structure
  • Increase success reward and add progressive rewards for near-goal states
  • Track success positions and epsilon history
  • Achieve 5/5 successful episodes in play mode with consistent performance

Summary by Sourcery

Implement Double DQN architectures for Acrobot-v1 and MountainCar-v0 environments, introducing enhanced exploration strategies and reward structures.

New Features:

  • Implement a Double DQN architecture for the Acrobot-v1 environment with experience replay and continuous state space handling.
  • Introduce a MountainCar-v0 DQN implementation featuring Double DQN architecture, prioritized experience replay, and dueling networks.

Enhancements:

  • Enhance exploration strategy in the MountainCar-v0 environment with momentum-based decisions and position-based strategies.
  • Add a position and velocity-based reward structure to the MountainCar-v0 environment, including increased success rewards and progressive rewards for near-goal states.

- Implement enhanced exploration strategy with momentum-based decisions
- Add position and velocity-based reward structure
- Increase success reward and add progressive rewards for near-goal states
- Track success positions and epsilon history
- Achieve 5/5 successful episodes in play mode with consistent performance
Copy link
Contributor

sourcery-ai bot commented Nov 26, 2024

Reviewer's Guide by Sourcery

This PR implements two reinforcement learning agents for the Acrobot and MountainCar environments using Deep Q-Networks (DQN). The implementation includes several advanced features like Double DQN, Experience Replay, and enhanced exploration strategies. Both agents use curriculum learning and sophisticated reward shaping to improve training efficiency.

Sequence diagram for AcrobotAI training process

sequenceDiagram
    actor Trainer
    participant AcrobotAI
    participant Environment
    participant ReplayMemory
    participant DQN
    Trainer->>AcrobotAI: train(num_episodes)
    loop for each episode
        AcrobotAI->>Environment: reset()
        AcrobotAI->>DQN: select_action(state)
        DQN-->>AcrobotAI: action
        AcrobotAI->>Environment: step(action)
        Environment-->>AcrobotAI: next_state, reward, done
        AcrobotAI->>ReplayMemory: push(Experience)
        AcrobotAI->>DQN: optimize_model(experiences)
        DQN-->>AcrobotAI: update policy
    end
    AcrobotAI-->>Trainer: training complete
Loading

Sequence diagram for MountainCarAI training process

sequenceDiagram
    actor Trainer
    participant MountainCarAI
    participant Environment
    participant ReplayMemory
    participant DQN
    Trainer->>MountainCarAI: train(num_episodes)
    loop for each episode
        MountainCarAI->>Environment: reset()
        MountainCarAI->>DQN: select_action(state)
        DQN-->>MountainCarAI: action
        MountainCarAI->>Environment: step(action)
        Environment-->>MountainCarAI: next_state, reward, done
        MountainCarAI->>ReplayMemory: push(Experience)
        MountainCarAI->>DQN: optimize_model()
        DQN-->>MountainCarAI: update policy
    end
    MountainCarAI-->>Trainer: training complete
Loading

Class diagram for AcrobotAI and MountainCarAI

classDiagram
    class DQN {
        +Linear fc1
        +Linear fc2
        +Linear fc3
        +forward(Tensor x) Tensor
    }
    class ReplayMemory {
        +deque memory
        +push(Experience experience)
        +sample(int batch_size) List~Experience~
        +int __len__()
    }
    class AcrobotAI {
        +DQN policy_net
        +DQN target_net
        +ReplayMemory memory
        +select_action(Tensor state) Tensor
        +optimize_model(List~Experience~ experiences)
        +train(int num_episodes) bool
        +play(int episodes)
    }
    class MountainCarAI {
        +DQN policy_net
        +DQN target_net
        +ReplayMemory memory
        +select_action(Tensor state) int
        +optimize_model()
        +train(int num_episodes) bool
        +play(int episodes)
    }
    DQN <|-- AcrobotAI
    DQN <|-- MountainCarAI
    ReplayMemory <|-- AcrobotAI
    ReplayMemory <|-- MountainCarAI
    Experience <|-- ReplayMemory
    Experience : state
    Experience : action
    Experience : reward
    Experience : next_state
    Experience : done
    class Experience {
        +state
        +action
        +reward
        +next_state
        +done
    }
Loading

File-Level Changes

Change Details Files
Implemented Acrobot DQN agent with curriculum learning and momentum-based exploration
  • Created DQN architecture with 3-layer neural network
  • Implemented experience replay buffer for training stability
  • Added curriculum-based learning with progressive difficulty phases
  • Designed momentum-based exploration strategy with smart random actions
  • Implemented progressive reward structure based on height and stability
gym/acrobot.py
Implemented MountainCar DQN agent with prioritized experience and position-based rewards
  • Created simplified DQN architecture optimized for MountainCar
  • Implemented prioritized experience replay memory
  • Added enhanced exploration strategy based on velocity and position
  • Designed position and velocity-based reward structure
  • Added success position and epsilon history tracking
gym/mountaincar.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai bot changed the title @sourcery-ai Implement DQN for Acrobot and MountainCar with enhanced exploration Nov 26, 2024
@leonvanbokhorst leonvanbokhorst self-assigned this Nov 26, 2024
@leonvanbokhorst leonvanbokhorst added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 26, 2024
@leonvanbokhorst leonvanbokhorst added this to the Phase 1 milestone Nov 26, 2024
@leonvanbokhorst leonvanbokhorst merged commit ca62d33 into main Nov 26, 2024
1 check failed
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonvanbokhorst - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

cos_theta2, sin_theta2 = state[2].item(), state[3].item()
tip_height = -cos_theta1 - cos_theta2

# More exploration when stuck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): stuck_counter is used before initialization

Initialize stuck_counter in init to avoid potential runtime errors.

done = terminated or truncated
steps += 1

# Calculate height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting reward calculation logic into a dedicated method to improve code organization.

The reward calculation logic can be simplified by extracting it into a dedicated method. This would improve readability while maintaining all functionality. Here's a suggested refactor:

def calculate_reward(self, tip_height: float, target_height: float, prev_max_height: float) -> tuple[float, bool]:
    reward = 0
    episode_success = False

    # Base curriculum reward
    if tip_height > target_height:
        reward += 10.0

    # Progressive rewards
    reward += (tip_height - target_height) * 5.0

    # Previous max height bonus
    if tip_height > prev_max_height:
        reward += 5.0
        self.stuck_counter = 0
    else:
        self.stuck_counter += 1

    # Success reward
    if tip_height > SUCCESS_HEIGHT:
        episode_success = True
        reward += 100.0

    return reward - 0.1, episode_success  # Include time penalty

Then simplify the training loop:

while not done:
    action = self.select_action(state)
    next_state, _, terminated, truncated, _ = self.env.step(action.item())
    done = terminated or truncated
    steps += 1

    tip_height = -(next_state[0] + next_state[2])  # Simplified height calc
    max_height = max(max_height, tip_height)

    reward, episode_success = self.calculate_reward(
        tip_height, target_height, prev_max_height
    )

    self.memory.push(Experience(state, action, reward, next_state, done))
    # ... rest of training loop

This separates reward calculation concerns from the main training flow while keeping all the existing logic intact.

torch.nn.utils.clip_grad_value_(self.policy_net.parameters(), 100)
self.optimizer.step()

def train(self, num_episodes: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

ai = AcrobotAI()

print("🏃‍♂️ Training the agent...")
solved = ai.train(num_episodes=1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)


position, velocity = state

if random.random() > epsilon:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Merge else clause's nested if statement into elif (merge-else-if-into-elif)

torch.nn.utils.clip_grad_value_(self.policy_net.parameters(), 100)
self.optimizer.step()

def train(self, num_episodes: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

ai = MountainCarAI()

print("\n🏃‍♂️ Training the agent...")
solved = ai.train(num_episodes=1000) # Increased max episodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant