intro-rl-toy-example.html

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta name="author" content="Cathy Yeh" />

        <meta property="og:type" content="article" />
        <meta name="twitter:card" content="summary">

<meta name="keywords" content="reinforcement learning, machine learning, OpenAI, Machine learning, " />

<meta property="og:title" content="Introduction to reinforcement learning by example "/>
<meta property="og:url" content="./intro-rl-toy-example" />
<meta property="og:description" content="We take a top-down approach to introducing reinforcement learning (RL) by starting with a toy example: a student going through college. In order to frame the problem from the RL point-of-view, we’ll walk through the following steps: Setting up a model of the problem as a Markov Decision Process …" />
<meta property="og:site_name" content="EFAVDB" />
<meta property="og:article:author" content="Cathy Yeh" />
<meta property="og:article:published_time" content="2020-03-11T12:00:00-07:00" />
<meta name="twitter:title" content="Introduction to reinforcement learning by example ">
<meta name="twitter:description" content="We take a top-down approach to introducing reinforcement learning (RL) by starting with a toy example: a student going through college. In order to frame the problem from the RL point-of-view, we’ll walk through the following steps: Setting up a model of the problem as a Markov Decision Process …">

        <title>Introduction to reinforcement learning by example  · EFAVDB
</title>
        <link href="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.2/css/bootstrap-combined.min.css" rel="stylesheet">
        <link href="https://fonts.googleapis.com/css?family=Fira+Sans:300,400,700" rel="stylesheet" type='text/css' />
        <link href="https://fonts.googleapis.com/css?family=Cardo:400,700" rel="stylesheet" type='text/css' />        
        <link rel="stylesheet" type="text/css" href="./theme/css/elegant.prod.css" media="screen">
        <link rel="stylesheet" type="text/css" href="./theme/css/custom.css" media="screen">


        <link rel="apple-touch-icon" sizes="180x180" href="./images/apple-touch-icon.png">
        <link rel="icon" type="image/png" sizes="32x32" href="./images/favicon-32x32.png">
        <link rel="icon" type="image/png" sizes="16x16" href="./images/favicon-16x16.png">
        <link rel="manifest" href="./images/site.webmanifest">
        <link rel="mask-icon" href="./images/safari-pinned-tab.svg" color="#5bbad5">
        <meta name="msapplication-TileColor" content="#da532c">
        <meta name="theme-color" content="#ffffff">
    </head>
    <body>
        <div id="content">
            <div class="navbar navbar-static-top">
                <div class="navbar-inner">
                    <div class="container-fluid">
                        <a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
                            <span class="icon-bar"></span>
                            <span class="icon-bar"></span>
                            <span class="icon-bar"></span>
                        </a>
                        <a class="brand" href="./"><span class=site-name>EFAVDB</span></a>
                        <div class="nav-collapse collapse">
                            <ul class="nav pull-right top-menu">
                                <li >
                                    <a href=
                                       .
                                    >Home</a>
                                </li>
                                <li ><a href="./pages/authors.html">Authors</a></li>
                                <li ><a href="./categories.html">Categories</a></li>
                                <li ><a href="./tags.html">Tags</a></li>
                                <li ><a href="./archives.html">Archives</a></li>
                                <li><form class="navbar-search" action="./search.html" onsubmit="return validateForm(this.elements['q'].value);"> <input type="text" class="search-query" placeholder="Search" name="q" id="tipue_search_input"></form></li>
                            </ul>
                        </div>
                    </div>
                </div>
            </div>
            <div class="container-fluid">
                <div class="row-fluid">
                    <div class="span1"></div>
                    <div class="span10">
<article itemscope>
<div class="row-fluid">
    <header class="page-header span8 offset2">
        <h1>
            <a href="./intro-rl-toy-example">
                Introduction to reinforcement learning by&nbsp;example
            </a>
        </h1>
    </header>
    <div class="span2"></div>
</div>

<div class="row-fluid">
        <div class="span8 offset2 article-content">
            
            <p>We take a top-down approach to introducing reinforcement learning (<span class="caps">RL</span>) by starting with a toy example: a student going through college.  In order to frame the problem from the <span class="caps">RL</span> point-of-view, we&#8217;ll walk through the following&nbsp;steps:</p>
<ul>
<li><strong>Setting up a model of the problem</strong> as a Markov Decision Process, the framework that underpins the <span class="caps">RL</span> approach to sequential decision-making&nbsp;problems</li>
<li><strong>Deciding on an objective</strong>: maximize&nbsp;rewards</li>
<li><strong>Writing down an equation whose solution is our objective</strong>: Bellman&nbsp;equations</li>
</ul>
<p>David Silver walks through this example in his <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">lecture notes</a> on <span class="caps">RL</span>, but as far as we can tell, does not provide code, so we&#8217;re sharing our implementation,&nbsp;comprising:</p>
<ul>
<li>the student&#8217;s college <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">environment</a> using the OpenAI gym&nbsp;package.</li>
<li>a <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a> sampling from the&nbsp;model</li>
</ul>
<h2>Student in toy&nbsp;college</h2>
<p>We model the student as an agent in a college environment who can move between five states: <span class="caps">CLASS</span> 1, 2, 3, the <span class="caps">FACEBOOK</span> state, and <span class="caps">SLEEP</span> state.  The states are represented by the four circles and square.  The <span class="caps">SLEEP</span> state &#8212; the square with no outward bound arrows &#8212; is a terminal state, i.e. once a student reaches that state, her journey is&nbsp;finished.</p>
<p><img alt="student MDP" src="./images/student_mdp.png"></p>
<p>Actions that a student can take in her current state are labeled in red (facebook/quit/study/sleep/pub) and influence which state she’ll find herself in&nbsp;next.</p>
<p>In this model, most state transitions are deterministic functions of the action in the current state, e.g. if she decides to study in <span class="caps">CLASS</span> 1, then she’ll definitely advance to <span class="caps">CLASS</span> 2.  The single non-deterministic state transition is if she goes pubbing while in <span class="caps">CLASS</span> 3, where the pubbing action is indicated by a solid dot; she can end up in <span class="caps">CLASS</span> 1, 2 or back in 3 with probability 0.2, 0.4, or 0.4, respectively, depending on how reckless the pubbing&nbsp;was.</p>
<p>The model also specifies the reward <span class="math">\(R\)</span> associated with acting in one state and ending up in the next.  In this example, the dynamics <span class="math">\(p(s’,r|s,a)\)</span>, are given to us, i.e. we have a full model of the environment, and, hopefully, the rewards have been designed to capture the actual end goal of the&nbsp;student.</p>
<h2>Markov Decision&nbsp;Process</h2>
<p>Formally, we’ve modeled the student’s college experience as a finite Markov Decision Process (<span class="caps">MDP</span>).  The dynamics are Markov because the probability of ending up in the next state depends only on the current state and action, not on any history leading up to the current state.  The Markov property is integral to the simplification of the equations that describe the model, which we&#8217;ll see in a&nbsp;bit.</p>
<p>The components of an <span class="caps">MDP</span>&nbsp;are:</p>
<ul>
<li><span class="math">\(S\)</span> - the set of possible&nbsp;states</li>
<li><span class="math">\(R\)</span> - the set of (scalar)&nbsp;rewards</li>
<li><span class="math">\(A\)</span> - the set of possible actions in each&nbsp;state</li>
</ul>
<p>The dynamics of the system are described by the probabilities of receiving a reward in the next state given the current state and action taken, <span class="math">\(p(s’,r|s,a)\)</span>.  In this example, the <span class="caps">MDP</span> is finite because there are a finite number of states, rewards, and&nbsp;actions.</p>
<p>The student’s agency in this environment comes from how she decides to act in each state.  The mapping of a state to actions is the <strong>policy</strong>, <span class="math">\(\pi(a|s) := p(a|s)\)</span>, and can be a deterministic or stochastic function of her&nbsp;state.</p>
<p>Suppose we have an indifferent student who always chooses actions randomly.  We can sample from the <span class="caps">MDP</span> to get some example trajectories the student might experience with this policy.  In the sample trajectories below, the states are enclosed in parentheses <code>(STATE)</code>, and actions enclosed in square brackets <code>[action]</code>.</p>
<p><strong>Sample trajectories</strong>:</p>
<div class="highlight"><pre><span></span><code><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">sleep</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>

<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS3</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>

<span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>

<span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">sleep</span><span class="o">]--&gt;</span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>

<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="c1">--[facebook]--&gt;(FACEBOOK)--[facebook]--&gt;(FACEBOOK)--[facebook]--&gt;(FACEBOOK)--[facebook]--&gt;(FACEBOOK)--[quit]--&gt;(CLASS1)--[facebook]--&gt;(FACEBOOK)--[quit]--&gt;(CLASS1)--[study]--&gt;(CLASS2)--[study]--&gt;(CLASS3)--[pub]--&gt;(CLASS2)--[study]--&gt;(CLASS3)--[study]--&gt;(SLEEP)</span>
</code></pre></div>

<p><strong>Rewards following a random policy</strong>:</p>
<p>Under this random policy, what total reward would the student expect when starting from any of the states?  We can estimate the expected rewards by summing up the rewards per trajectory and plotting the distributions of total rewards per starting&nbsp;state:</p>
<p><img alt="histogram of sampled returns" src="./images/intro_rl_histogram_sampled_returns.png"></p>
<h2>Maximizing rewards: discounted return and value&nbsp;functions</h2>
<p>We’ve just seen how we can estimate rewards starting from each state given a random policy.  Next, we’ll formalize our goal in terms of maximizing&nbsp;returns.</p>
<h3>Returns</h3>
<p>We simply summed the rewards from the sample trajectories above, but the quantity we often want to maximize in practice is the <strong>discounted return <span class="math">\(G_t\)</span></strong>, which is a sum of the weighted&nbsp;rewards:</p>
<div class="math">\begin{eqnarray}\label{return} \tag{1}
G_t := R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^\infty \gamma^k R_{t+k+1}
\end{eqnarray}</div>
<p>where <span class="math">\(0 \leq \gamma \leq 1\)</span>.  <span class="math">\(\gamma\)</span> is the <em>discount rate</em> which characterizes how much we weight rewards now vs. later.  Discounting is mathematically useful for avoiding infinite returns in MDPs without a terminal state and allows us to account for uncertainty in the future when we don’t have a perfect model of the&nbsp;environment.</p>
<p><strong>Aside</strong></p>
<p>The discount factor introduces a time scale since it says that we don&#8217;t care about rewards that are far in the future.  The half-life (actually, the <span class="math">\(1/e\)</span> life) of a reward in units of time steps is <span class="math">\(1/(1-\gamma)\)</span>, which comes from a definition of <span class="math">\(1/e\)</span>:</p>
<div class="math">\begin{align}
\frac{1}{e} = \lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n
\end{align}</div>
<p><span class="math">\(\gamma = 0.99\)</span> is often used in practice, which corresponds to a half-life of 100 timesteps since <span class="math">\(0.99^{100} = (1 - 1/100)^{100} \approx 1/e\)</span>.</p>
<h3>Value&nbsp;functions</h3>
<p>Earlier, we were able to estimate the expected undiscounted returns starting from each state by sampling from the <span class="caps">MDP</span> under a random policy.  Value functions formalize this notion of the &#8220;goodness&#8221; of being in a&nbsp;state.</p>
<h4>State value function <span class="math">\(v\)</span></h4>
<p>The <strong>state value function</strong> <span class="math">\(v_{\pi}(s)\)</span> is the expected return when starting in state <span class="math">\(s\)</span>, following policy <span class="math">\(\pi\)</span>.</p>
<div class="math">\begin{eqnarray}\label{state-value} \tag{2}
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
\end{eqnarray}</div>
<p>The state value function can be written as a recursive relationship, the Bellman expectation equation, expressing the value of a state in terms of the values of its neighors by making use of the Markov&nbsp;property.</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman} \tag{3}
v_{\pi}(s) &amp;=&amp; \mathbb{E}_{\pi}[G_t | S_t = s] \\
       &amp;=&amp; \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+2} | S_t = s] \\
       &amp;=&amp; \sum_{a} \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_{\pi}(s’) ]
\end{eqnarray}</div>
<p>This equation expresses the value of a state as an average over the discounted value of its neighbor / successor states, plus the expected reward transitioning from <span class="math">\(s\)</span> to <span class="math">\(s’\)</span>, and <span class="math">\(v_{\pi}\)</span> is the unique<a href="#unique">*</a> solution.  The distribution of rewards depends on the student’s policy since her actions influence her future&nbsp;rewards.</p>
<p><em>Note on terminology</em>:
Policy <em>evaluation</em> uses the Bellman expectation equation to solve for the value function given a policy <span class="math">\(\pi\)</span> and environment dynamics <span class="math">\(p(s’, r | s, a)\)</span>.  This is different from policy iteration and value iteration, which are concerned with finding an optimal&nbsp;policy.</p>
<p>We can solve the Bellman equation for the value function as an alternative to the sampling we did earlier for the student toy example.  Since the problem has a small number of states and actions, and we have full knowledge of the environment, an exact solution is feasible by directly solving the system of linear equations or iteratively using dynamic programming.  Here is the solution to (\ref{state-value-bellman}) for <span class="math">\(v\)</span> under a random policy in the student example (compare to the sample means in the histogram of&nbsp;returns):</p>
<p><img alt="student MDP value function random policy" src="./images/student_mdp_values_random_policy.png"></p>
<p>We can verify that the solution is self-consistent by spot checking the value of a state in terms of the values of its neighboring states according to the Bellman equation, e.g. the <span class="caps">CLASS1</span> state with <span class="math">\(v_{\pi}(\text{CLASS1}) = -1.3\)</span>:</p>
<div class="math">$$
v_{\pi}(\text{CLASS1}) = 0.5 [-2 + 2.7] + 0.5 [-1 + -2.3] = -1.3
$$</div>
<h4>Action value function <span class="math">\(q\)</span></h4>
<p>Another value function is the action value function <span class="math">\(q_{\pi}(s, a)\)</span>, which is the expected return from a state <span class="math">\(s\)</span> if we follow a policy <span class="math">\(\pi\)</span> after taking an action <span class="math">\(a\)</span>:</p>
<div class="math">\begin{eqnarray}\label{action-value} \tag{4}
q_{\pi}(s, a) := \mathbb{E}_{\pi} [ G_t | S_t = s, A = a ]
\end{eqnarray}</div>
<p>We can also write <span class="math">\(v\)</span> and <span class="math">\(q\)</span> in terms of each other.  For example, the state value function can be viewed as an average over the action value functions for that state, weighted by the probability of taking each action, <span class="math">\(\pi\)</span>, from that&nbsp;state:</p>
<div class="math">\begin{eqnarray}\label{state-value-one-step-backup} \tag{5}
v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)
\end{eqnarray}</div>
<p>Rewriting <span class="math">\(v\)</span> in terms of <span class="math">\(q\)</span> in (\ref{state-value-one-step-backup}) is useful later for thinking about the &#8220;advantage&#8221;, <span class="math">\(A(s,a)\)</span>, of taking an action in a state, namely how much better is an action in that state than the&nbsp;average?</p>
<div class="math">\begin{align}
A(s,a) \equiv q(s,a) - v(s)
\end{align}</div>
<hr>
<p><strong>Why <span class="math">\(q\)</span> in addition to <span class="math">\(v\)</span>?</strong></p>
<p>Looking ahead, we almost never have access to the environment dynamics in real world problems, but solving for <span class="math">\(q\)</span> instead of <span class="math">\(v\)</span> lets us get around this problem; we can figure out the best action to take in a state solely using <span class="math">\(q\)</span> (we further expand on this in our <a href="#optimalq">discussion</a> below on the Bellman optimality equation for <span class="math">\(q_*\)</span>.</p>
<p>A concrete example of using <span class="math">\(q\)</span> is provided in our <a href="./multiarmed-bandits">post</a> on multiarmed bandits (an example of a simple single-state <span class="caps">MDP</span>), which discusses agents/algorithms that don&#8217;t have access to the true environment dynamics.  The strategy amounts to estimating the action value function of the slot machine and using those estimates to inform which slot machine arms to pull in order to maximize&nbsp;rewards.</p>
<hr>
<h2>Optimal value and&nbsp;policy</h2>
<p>The crux of the <span class="caps">RL</span> problem is finding a policy that maximizes the expected return.  A policy <span class="math">\(\pi\)</span> is defined to be better than another policy <span class="math">\(\pi’\)</span> if <span class="math">\(v_{\pi}(s) &gt; v_{\pi’}(s)\)</span> for all states.  We are guaranteed<a href="#unique">*</a> an optimal state value function <span class="math">\(v_*\)</span> which corresponds to one or more optimal policies <span class="math">\(\pi*\)</span>.</p>
<p>Recall that the value function for an arbitrary policy can be written in terms of an average over the action values for that state (\ref{state-value-one-step-backup}).  In contrast, the optimal value function <span class="math">\(v_*\)</span> must be consistent with following a policy that selects the action that maximizes the action value functions from a state, i.e. taking a <span class="math">\(\max\)</span> (\ref{state-value-bellman-optimality}) instead of an average (\ref{state-value-one-step-backup}) over <span class="math">\(q\)</span>, leading to the <strong>Bellman optimality equation</strong> for <span class="math">\(v_*\)</span>:</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman-optimality} \tag{6}
v_*(s) &amp;=&amp; \max_a q_{\pi*}(s, a) \\
    &amp;=&amp; \max_a \mathbb{E}_{\pi*} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\
    &amp;=&amp; \max_a \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_*(s’) ]
\end{eqnarray}</div>
<p>The optimal policy immediately follows: take the action in a state that maximizes the right hand side of (\ref{state-value-bellman-optimality}).  The <a href="https://en.wikipedia.org/wiki/Bellman_equation#Bellman's_Principle_of_Optimality">principle of optimality</a>, which applies to the Bellman optimality equation, means that this greedy policy actually corresponds to the optimal policy!  Note: Unlike the Bellman expectation equations, the Bellman optimality equations are a nonlinear system of equations due to taking the&nbsp;max.</p>
<p>The Bellman optimality equation for the action value function <span class="math">\(q_*(s,a)\)</span><a name="optimalq"></a>&nbsp;is:</p>
<div class="math">\begin{eqnarray}\label{action-value-bellman-optimality} \tag{7}
q_*(s, a) &amp;=&amp; \mathbb{E}_{\pi*} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}', a') | S_t = s, A_t = a] \\
          &amp;=&amp; \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a') ]
\end{eqnarray}</div>
<hr>
<p>Looking ahead: In practice, without a knowledge of the environment dynamics, <span class="caps">RL</span> algorithms based on solving value functions can approximate the expectation in (\ref{action-value-bellman-optimality}) by sampling, i.e. interacting with the environment, and iteratively selecting the action that corresponds to maximizing <span class="math">\(q\)</span> in each state that the agent lands in along its trajectory, which is possible since the maximum occurs <strong>inside</strong> the summation in (\ref{action-value-bellman-optimality}).   In contrast, this sampling approach doesn&#8217;t work for (\ref{state-value-bellman-optimality}) because of the maximum <strong>outside</strong> the summation in&#8230;that&#8217;s why action value functions are so useful when we lack a model of the&nbsp;environment!</p>
<hr>
<p>Here is the optimal state value function and policy for the student example, which we solve for in a later&nbsp;post:</p>
<p><img alt="student MDP optimal value function" src="./images/student_mdp_optimal_values.png"></p>
<p>Comparing the values per state under the optimal policy vs the random policy, the value in every state under the optimal policy exceeds the value under the random&nbsp;policy.</p>
<h2>Summary</h2>
<p>We’ve discussed how the problem of sequential decision making can be framed as an <span class="caps">MDP</span> using the student toy <span class="caps">MDP</span> as an example.  The goal in <span class="caps">RL</span> is to figure out a policy &#8212; what actions to take in each state &#8212; that maximizes our&nbsp;returns.</p>
<p>MDPs provide a framework for approaching the problem by defining the value of each state, the value functions, and using the value functions to define what a “best policy” means.  The value functions are unique solutions to the Bellman equations, and the <span class="caps">MDP</span> is “solved” when we know the optimal value&nbsp;function.</p>
<p>Much of reinforcement learning centers around trying to solve these equations under different conditions, e.g. unknown environment dynamics and large &#8212; possibly continuous &#8212; states and/or action spaces that require approximations to the value&nbsp;functions.</p>
<p>We&#8217;ll discuss how we arrived at the solutions for this toy problem in a future&nbsp;post!</p>
<h3>Example&nbsp;code</h3>
<p>Code for sampling from the student environment under a random policy in order to generate the trajectories and histograms of returns is available in this <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a>.</p>
<p>The <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">code</a> for the student environment creates an environment with an <span class="caps">API</span> that is compatible with OpenAI gym &#8212; specifically, it is derived from the <code>gym.envs.toy_text.DiscreteEnv</code> environment.</p>
<p><a name="unique"><em></a>The uniqueness of the solution to the Bellman equations for finite MDPs is stated without proof in Ref [2], but Ref [1] motivates it briefly via the </em>contraction mapping&nbsp;theorem*.</p>
<h2>References</h2>
<p>[1] David Silver&#8217;s <span class="caps">RL</span> Course Lecture 2 - (<a href="https://www.youtube.com/watch?v=lfHX2hHRMVQ">video</a>,
  <a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf">slides</a>)</p>
<p>[2] Sutton and Barto -
  <a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> - Chapter 3: Finite Markov Decision&nbsp;Processes</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width < 768) ? "left" : align;
        indent = (screen.width < 768) ? "0em" : indent;
        linebreak = (screen.width < 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>


                <p id="post-share-links">
    Like this post?  Share on:
    <a href="https://twitter.com/intent/tweet?text=Introduction%20to%20reinforcement%20learning%20by%C2%A0example&url=/intro-rl-toy-example&hashtags=reinforcement-learning,machine-learning,openai" target="_blank" rel="nofollow noopener noreferrer" title="Share on Twitter">Twitter</a>
    |
    <a href="https://www.facebook.com/sharer/sharer.php?u=/intro-rl-toy-example" target="_blank" rel="nofollow noopener noreferrer" title="Share on Facebook">Facebook</a>
    |
    <a href="mailto:?subject=Introduction%20to%20reinforcement%20learning%20by%C2%A0example&amp;body=/intro-rl-toy-example" target="_blank" rel="nofollow noopener noreferrer" title="Share via Email">Email</a>
    </p>

                <hr />
    <div class="author_blurb">
        <a href="" target="_blank" rel="nofollow noopener noreferrer">
            <img src=/images/cy_efavdb_headshot.jpg alt="Cathy Yeh Avatar" title="Cathy Yeh">
            <span class="author_name">Cathy Yeh</span>
        </a>
        Cathy Yeh got a PhD at UC Santa Barbara studying soft-matter/polymer physics. After stints in a translational modeling group at Pfizer in San Diego, Driver (a no longer extant startup trying to match cancer patients to clinical trials) in San Francisco, and Square, she is currently participating in the OpenAI Scholars program.
    </div>

            
            <hr/>
            <aside>
            <nav>
            <ul class="articles-timeline">
                <li class="previous-article">« <a href="./world-wandering-dudes" title="Previous: A Framework for Studying Population Dynamics">A Framework for Studying Population Dynamics</a></li>
                <li class="next-article"><a href="./reinforcement-learning-dynamic-programming" title="Next: Dynamic programming in reinforcement learning">Dynamic programming in reinforcement learning</a> »</li>
            </ul>
            </nav>
            </aside>
        </div>
        <section id="article-sidebar" class="span2">
            <h4>Published</h4>
            <time itemprop="dateCreated" datetime="2020-03-11T12:00:00-07:00">Mar 11, 2020</time>
            <h4>Category</h4>
            <a class="category-link" href="./categories.html#machine-learning-ref">Machine learning</a>
            <h4>Tags</h4>
            <ul class="list-of-tags tags-in-article">
                <li><a href="./tags.html#machine-learning-ref">machine learning
                    <span>10</span>
</a></li>
                <li><a href="./tags.html#openai-ref">OpenAI
                    <span>4</span>
</a></li>
                <li><a href="./tags.html#reinforcement-learning-ref">reinforcement learning
                    <span>7</span>
</a></li>
            </ul>
<h4>Contact</h4>
<div id="sidebar-social-link">
    <a href="https://twitter.com/efavdb" title="" target="_blank" rel="nofollow noopener noreferrer">
        <svg xmlns="http://www.w3.org/2000/svg" aria-label="Twitter" role="img" viewBox="0 0 512 512"><rect width="512" height="512" rx="15%" fill="#1da1f3"/><path fill="#fff" d="M437 152a72 72 0 0 1-40 12 72 72 0 0 0 32-40 72 72 0 0 1-45 17 72 72 0 0 0-122 65 200 200 0 0 1-145-74 72 72 0 0 0 22 94 72 72 0 0 1-32-7 72 72 0 0 0 56 69 72 72 0 0 1-32 1 72 72 0 0 0 67 50 200 200 0 0 1-105 29 200 200 0 0 0 309-179 200 200 0 0 0 35-37"/></svg>
    </a>
    <a href="https://github.com/efavdb" title="" target="_blank" rel="nofollow noopener noreferrer">
        <svg xmlns="http://www.w3.org/2000/svg" aria-label="GitHub" role="img" viewBox="0 0 512 512"><rect width="512" height="512" rx="15%" fill="#1B1817"/><path fill="#fff" d="M335 499c14 0 12 17 12 17H165s-2-17 12-17c13 0 16-6 16-12l-1-50c-71 16-86-28-86-28-12-30-28-37-28-37-24-16 1-16 1-16 26 2 40 26 40 26 22 39 59 28 74 22 2-17 9-28 16-35-57-6-116-28-116-126 0-28 10-51 26-69-3-6-11-32 3-67 0 0 21-7 70 26 42-12 86-12 128 0 49-33 70-26 70-26 14 35 6 61 3 67 16 18 26 41 26 69 0 98-60 120-117 126 10 8 18 24 18 48l-1 70c0 6 3 12 16 12z"/></svg>
    </a>
    <a href="https://www.youtube.com/channel/UClfvjoSiu0VvWOh5OpnuusA" title="" target="_blank" rel="nofollow noopener noreferrer">
        <svg xmlns="http://www.w3.org/2000/svg" aria-label="YouTube" role="img" viewBox="0 0 512 512" fill="#ed1d24"><rect width="512" height="512" rx="15%"/><path d="m427 169c-4-15-17-27-32-31-34-9-239-10-278 0-15 4-28 16-32 31-9 38-10 135 0 174 4 15 17 27 32 31 36 10 241 10 278 0 15-4 28-16 32-31 9-36 9-137 0-174" fill="#fff"/><path d="m220 203v106l93-53"/></svg>
    </a>
</div>
            

        </section>
</div>
</article>
                        <button onclick="topFunction()" id="myBtn" title="Go to top">&#x25B2;</button>
                    </div>
                    <div class="span1"></div>
                </div>
            </div>
        </div>
<footer>

    <div>
        <span class="site-name"><a href= .>EFAVDB</span> - Everybody's Favorite Data Blog</a>
    </div>


    <!-- <div id="fpowered"> -->
    <!--     Powered by: <a href="http://getpelican.com/" title="Pelican Home Page" target="_blank" rel="nofollow noopener noreferrer">Pelican</a> -->
    <!--     Theme: <a href="https://elegant.oncrashreboot.com/" title="Theme Elegant Home Page" target="_blank" rel="nofollow noopener noreferrer">Elegant</a> -->
    <!--      -->
    <!-- </div> -->
</footer>            <script src="//code.jquery.com/jquery.min.js"></script>
        <script src="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.2/js/bootstrap.min.js"></script>
        <script>
            function validateForm(query)
            {
                return (query.length > 0);
            }
        </script>
        <script>
            //** Scroll to top button **
            //Get the button:
            mybutton = document.getElementById("myBtn");

            // When the user scrolls down 30px from the top of the document, show the button
            window.onscroll = function() {scrollFunction()};

            function scrollFunction() {
              if (document.body.scrollTop > 30 || document.documentElement.scrollTop > 30) {
                mybutton.style.display = "block";
              } else {
                mybutton.style.display = "none";
              }
            }

            // When the user clicks on the button, scroll to the top of the document
            function topFunction() {
              document.body.scrollTop = 0; // For Safari
              document.documentElement.scrollTop = 0; // For Chrome, Firefox, IE and Opera
            }
        </script>

    <script>
    (function () {
        if (window.location.hash.match(/^#comment-\d+$/)) {
            $('#comment_thread').collapse('show');
        }
    })();
    window.onhashchange=function(){
        if (window.location.hash.match(/^#comment-\d+$/))
            window.location.reload(true);
    }
    $('#comment_thread').on('shown', function () {
        var link = document.getElementById('comment-accordion-toggle');
        var old_innerHTML = link.innerHTML;
        $(link).fadeOut(200, function() {
            $(this).text('Click here to hide comments').fadeIn(200);
        });
        $('#comment_thread').on('hidden', function () {
            $(link).fadeOut(200, function() {
                $(this).text(old_innerHTML).fadeIn(200);
            });
        })
    })
</script>

    </body>
    <!-- Theme: Elegant built for Pelican
        License : MIT -->
</html>