-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro-rl-toy-example.html
441 lines (389 loc) · 37.3 KB
/
intro-rl-toy-example.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Cathy Yeh" />
<meta property="og:type" content="article" />
<meta name="twitter:card" content="summary">
<meta name="keywords" content="reinforcement learning, machine learning, OpenAI, Machine learning, " />
<meta property="og:title" content="Introduction to reinforcement learning by example "/>
<meta property="og:url" content="./intro-rl-toy-example" />
<meta property="og:description" content="We take a top-down approach to introducing reinforcement learning (RL) by starting with a toy example: a student going through college. In order to frame the problem from the RL point-of-view, we’ll walk through the following steps: Setting up a model of the problem as a Markov Decision Process …" />
<meta property="og:site_name" content="EFAVDB" />
<meta property="og:article:author" content="Cathy Yeh" />
<meta property="og:article:published_time" content="2020-03-11T12:00:00-07:00" />
<meta name="twitter:title" content="Introduction to reinforcement learning by example ">
<meta name="twitter:description" content="We take a top-down approach to introducing reinforcement learning (RL) by starting with a toy example: a student going through college. In order to frame the problem from the RL point-of-view, we’ll walk through the following steps: Setting up a model of the problem as a Markov Decision Process …">
<title>Introduction to reinforcement learning by example · EFAVDB
</title>
<link href="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.2/css/bootstrap-combined.min.css" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Fira+Sans:300,400,700" rel="stylesheet" type='text/css' />
<link href="https://fonts.googleapis.com/css?family=Cardo:400,700" rel="stylesheet" type='text/css' />
<link rel="stylesheet" type="text/css" href="./theme/css/elegant.prod.css" media="screen">
<link rel="stylesheet" type="text/css" href="./theme/css/custom.css" media="screen">
<link rel="apple-touch-icon" sizes="180x180" href="./images/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="./images/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="./images/favicon-16x16.png">
<link rel="manifest" href="./images/site.webmanifest">
<link rel="mask-icon" href="./images/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="theme-color" content="#ffffff">
</head>
<body>
<div id="content">
<div class="navbar navbar-static-top">
<div class="navbar-inner">
<div class="container-fluid">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<a class="brand" href="./"><span class=site-name>EFAVDB</span></a>
<div class="nav-collapse collapse">
<ul class="nav pull-right top-menu">
<li >
<a href=
.
>Home</a>
</li>
<li ><a href="./pages/authors.html">Authors</a></li>
<li ><a href="./categories.html">Categories</a></li>
<li ><a href="./tags.html">Tags</a></li>
<li ><a href="./archives.html">Archives</a></li>
<li><form class="navbar-search" action="./search.html" onsubmit="return validateForm(this.elements['q'].value);"> <input type="text" class="search-query" placeholder="Search" name="q" id="tipue_search_input"></form></li>
</ul>
</div>
</div>
</div>
</div>
<div class="container-fluid">
<div class="row-fluid">
<div class="span1"></div>
<div class="span10">
<article itemscope>
<div class="row-fluid">
<header class="page-header span8 offset2">
<h1>
<a href="./intro-rl-toy-example">
Introduction to reinforcement learning by example
</a>
</h1>
</header>
<div class="span2"></div>
</div>
<div class="row-fluid">
<div class="span8 offset2 article-content">
<p>We take a top-down approach to introducing reinforcement learning (<span class="caps">RL</span>) by starting with a toy example: a student going through college. In order to frame the problem from the <span class="caps">RL</span> point-of-view, we’ll walk through the following steps:</p>
<ul>
<li><strong>Setting up a model of the problem</strong> as a Markov Decision Process, the framework that underpins the <span class="caps">RL</span> approach to sequential decision-making problems</li>
<li><strong>Deciding on an objective</strong>: maximize rewards</li>
<li><strong>Writing down an equation whose solution is our objective</strong>: Bellman equations</li>
</ul>
<p>David Silver walks through this example in his <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">lecture notes</a> on <span class="caps">RL</span>, but as far as we can tell, does not provide code, so we’re sharing our implementation, comprising:</p>
<ul>
<li>the student’s college <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">environment</a> using the OpenAI gym package.</li>
<li>a <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a> sampling from the model</li>
</ul>
<h2>Student in toy college</h2>
<p>We model the student as an agent in a college environment who can move between five states: <span class="caps">CLASS</span> 1, 2, 3, the <span class="caps">FACEBOOK</span> state, and <span class="caps">SLEEP</span> state. The states are represented by the four circles and square. The <span class="caps">SLEEP</span> state — the square with no outward bound arrows — is a terminal state, i.e. once a student reaches that state, her journey is finished.</p>
<p><img alt="student MDP" src="./images/student_mdp.png"></p>
<p>Actions that a student can take in her current state are labeled in red (facebook/quit/study/sleep/pub) and influence which state she’ll find herself in next.</p>
<p>In this model, most state transitions are deterministic functions of the action in the current state, e.g. if she decides to study in <span class="caps">CLASS</span> 1, then she’ll definitely advance to <span class="caps">CLASS</span> 2. The single non-deterministic state transition is if she goes pubbing while in <span class="caps">CLASS</span> 3, where the pubbing action is indicated by a solid dot; she can end up in <span class="caps">CLASS</span> 1, 2 or back in 3 with probability 0.2, 0.4, or 0.4, respectively, depending on how reckless the pubbing was.</p>
<p>The model also specifies the reward <span class="math">\(R\)</span> associated with acting in one state and ending up in the next. In this example, the dynamics <span class="math">\(p(s’,r|s,a)\)</span>, are given to us, i.e. we have a full model of the environment, and, hopefully, the rewards have been designed to capture the actual end goal of the student.</p>
<h2>Markov Decision Process</h2>
<p>Formally, we’ve modeled the student’s college experience as a finite Markov Decision Process (<span class="caps">MDP</span>). The dynamics are Markov because the probability of ending up in the next state depends only on the current state and action, not on any history leading up to the current state. The Markov property is integral to the simplification of the equations that describe the model, which we’ll see in a bit.</p>
<p>The components of an <span class="caps">MDP</span> are:</p>
<ul>
<li><span class="math">\(S\)</span> - the set of possible states</li>
<li><span class="math">\(R\)</span> - the set of (scalar) rewards</li>
<li><span class="math">\(A\)</span> - the set of possible actions in each state</li>
</ul>
<p>The dynamics of the system are described by the probabilities of receiving a reward in the next state given the current state and action taken, <span class="math">\(p(s’,r|s,a)\)</span>. In this example, the <span class="caps">MDP</span> is finite because there are a finite number of states, rewards, and actions.</p>
<p>The student’s agency in this environment comes from how she decides to act in each state. The mapping of a state to actions is the <strong>policy</strong>, <span class="math">\(\pi(a|s) := p(a|s)\)</span>, and can be a deterministic or stochastic function of her state.</p>
<p>Suppose we have an indifferent student who always chooses actions randomly. We can sample from the <span class="caps">MDP</span> to get some example trajectories the student might experience with this policy. In the sample trajectories below, the states are enclosed in parentheses <code>(STATE)</code>, and actions enclosed in square brackets <code>[action]</code>.</p>
<p><strong>Sample trajectories</strong>:</p>
<div class="highlight"><pre><span></span><code><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">sleep</span><span class="o">]--></span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>
<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS3</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--></span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>
<span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>
<span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">facebook</span><span class="o">]--></span><span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="o">--[</span><span class="n">quit</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="o">--[</span><span class="n">study</span><span class="o">]--></span><span class="p">(</span><span class="n">CLASS2</span><span class="p">)</span><span class="o">--[</span><span class="n">sleep</span><span class="o">]--></span><span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span>
<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="c1">--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[study]-->(CLASS2)--[study]-->(CLASS3)--[pub]-->(CLASS2)--[study]-->(CLASS3)--[study]-->(SLEEP)</span>
</code></pre></div>
<p><strong>Rewards following a random policy</strong>:</p>
<p>Under this random policy, what total reward would the student expect when starting from any of the states? We can estimate the expected rewards by summing up the rewards per trajectory and plotting the distributions of total rewards per starting state:</p>
<p><img alt="histogram of sampled returns" src="./images/intro_rl_histogram_sampled_returns.png"></p>
<h2>Maximizing rewards: discounted return and value functions</h2>
<p>We’ve just seen how we can estimate rewards starting from each state given a random policy. Next, we’ll formalize our goal in terms of maximizing returns.</p>
<h3>Returns</h3>
<p>We simply summed the rewards from the sample trajectories above, but the quantity we often want to maximize in practice is the <strong>discounted return <span class="math">\(G_t\)</span></strong>, which is a sum of the weighted rewards:</p>
<div class="math">\begin{eqnarray}\label{return} \tag{1}
G_t := R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^\infty \gamma^k R_{t+k+1}
\end{eqnarray}</div>
<p>where <span class="math">\(0 \leq \gamma \leq 1\)</span>. <span class="math">\(\gamma\)</span> is the <em>discount rate</em> which characterizes how much we weight rewards now vs. later. Discounting is mathematically useful for avoiding infinite returns in MDPs without a terminal state and allows us to account for uncertainty in the future when we don’t have a perfect model of the environment.</p>
<p><strong>Aside</strong></p>
<p>The discount factor introduces a time scale since it says that we don’t care about rewards that are far in the future. The half-life (actually, the <span class="math">\(1/e\)</span> life) of a reward in units of time steps is <span class="math">\(1/(1-\gamma)\)</span>, which comes from a definition of <span class="math">\(1/e\)</span>:</p>
<div class="math">\begin{align}
\frac{1}{e} = \lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n
\end{align}</div>
<p><span class="math">\(\gamma = 0.99\)</span> is often used in practice, which corresponds to a half-life of 100 timesteps since <span class="math">\(0.99^{100} = (1 - 1/100)^{100} \approx 1/e\)</span>.</p>
<h3>Value functions</h3>
<p>Earlier, we were able to estimate the expected undiscounted returns starting from each state by sampling from the <span class="caps">MDP</span> under a random policy. Value functions formalize this notion of the “goodness” of being in a state.</p>
<h4>State value function <span class="math">\(v\)</span></h4>
<p>The <strong>state value function</strong> <span class="math">\(v_{\pi}(s)\)</span> is the expected return when starting in state <span class="math">\(s\)</span>, following policy <span class="math">\(\pi\)</span>.</p>
<div class="math">\begin{eqnarray}\label{state-value} \tag{2}
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
\end{eqnarray}</div>
<p>The state value function can be written as a recursive relationship, the Bellman expectation equation, expressing the value of a state in terms of the values of its neighors by making use of the Markov property.</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman} \tag{3}
v_{\pi}(s) &=& \mathbb{E}_{\pi}[G_t | S_t = s] \\
&=& \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+2} | S_t = s] \\
&=& \sum_{a} \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_{\pi}(s’) ]
\end{eqnarray}</div>
<p>This equation expresses the value of a state as an average over the discounted value of its neighbor / successor states, plus the expected reward transitioning from <span class="math">\(s\)</span> to <span class="math">\(s’\)</span>, and <span class="math">\(v_{\pi}\)</span> is the unique<a href="#unique">*</a> solution. The distribution of rewards depends on the student’s policy since her actions influence her future rewards.</p>
<p><em>Note on terminology</em>:
Policy <em>evaluation</em> uses the Bellman expectation equation to solve for the value function given a policy <span class="math">\(\pi\)</span> and environment dynamics <span class="math">\(p(s’, r | s, a)\)</span>. This is different from policy iteration and value iteration, which are concerned with finding an optimal policy.</p>
<p>We can solve the Bellman equation for the value function as an alternative to the sampling we did earlier for the student toy example. Since the problem has a small number of states and actions, and we have full knowledge of the environment, an exact solution is feasible by directly solving the system of linear equations or iteratively using dynamic programming. Here is the solution to (\ref{state-value-bellman}) for <span class="math">\(v\)</span> under a random policy in the student example (compare to the sample means in the histogram of returns):</p>
<p><img alt="student MDP value function random policy" src="./images/student_mdp_values_random_policy.png"></p>
<p>We can verify that the solution is self-consistent by spot checking the value of a state in terms of the values of its neighboring states according to the Bellman equation, e.g. the <span class="caps">CLASS1</span> state with <span class="math">\(v_{\pi}(\text{CLASS1}) = -1.3\)</span>:</p>
<div class="math">$$
v_{\pi}(\text{CLASS1}) = 0.5 [-2 + 2.7] + 0.5 [-1 + -2.3] = -1.3
$$</div>
<h4>Action value function <span class="math">\(q\)</span></h4>
<p>Another value function is the action value function <span class="math">\(q_{\pi}(s, a)\)</span>, which is the expected return from a state <span class="math">\(s\)</span> if we follow a policy <span class="math">\(\pi\)</span> after taking an action <span class="math">\(a\)</span>:</p>
<div class="math">\begin{eqnarray}\label{action-value} \tag{4}
q_{\pi}(s, a) := \mathbb{E}_{\pi} [ G_t | S_t = s, A = a ]
\end{eqnarray}</div>
<p>We can also write <span class="math">\(v\)</span> and <span class="math">\(q\)</span> in terms of each other. For example, the state value function can be viewed as an average over the action value functions for that state, weighted by the probability of taking each action, <span class="math">\(\pi\)</span>, from that state:</p>
<div class="math">\begin{eqnarray}\label{state-value-one-step-backup} \tag{5}
v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)
\end{eqnarray}</div>
<p>Rewriting <span class="math">\(v\)</span> in terms of <span class="math">\(q\)</span> in (\ref{state-value-one-step-backup}) is useful later for thinking about the “advantage”, <span class="math">\(A(s,a)\)</span>, of taking an action in a state, namely how much better is an action in that state than the average?</p>
<div class="math">\begin{align}
A(s,a) \equiv q(s,a) - v(s)
\end{align}</div>
<hr>
<p><strong>Why <span class="math">\(q\)</span> in addition to <span class="math">\(v\)</span>?</strong></p>
<p>Looking ahead, we almost never have access to the environment dynamics in real world problems, but solving for <span class="math">\(q\)</span> instead of <span class="math">\(v\)</span> lets us get around this problem; we can figure out the best action to take in a state solely using <span class="math">\(q\)</span> (we further expand on this in our <a href="#optimalq">discussion</a> below on the Bellman optimality equation for <span class="math">\(q_*\)</span>.</p>
<p>A concrete example of using <span class="math">\(q\)</span> is provided in our <a href="./multiarmed-bandits">post</a> on multiarmed bandits (an example of a simple single-state <span class="caps">MDP</span>), which discusses agents/algorithms that don’t have access to the true environment dynamics. The strategy amounts to estimating the action value function of the slot machine and using those estimates to inform which slot machine arms to pull in order to maximize rewards.</p>
<hr>
<h2>Optimal value and policy</h2>
<p>The crux of the <span class="caps">RL</span> problem is finding a policy that maximizes the expected return. A policy <span class="math">\(\pi\)</span> is defined to be better than another policy <span class="math">\(\pi’\)</span> if <span class="math">\(v_{\pi}(s) > v_{\pi’}(s)\)</span> for all states. We are guaranteed<a href="#unique">*</a> an optimal state value function <span class="math">\(v_*\)</span> which corresponds to one or more optimal policies <span class="math">\(\pi*\)</span>.</p>
<p>Recall that the value function for an arbitrary policy can be written in terms of an average over the action values for that state (\ref{state-value-one-step-backup}). In contrast, the optimal value function <span class="math">\(v_*\)</span> must be consistent with following a policy that selects the action that maximizes the action value functions from a state, i.e. taking a <span class="math">\(\max\)</span> (\ref{state-value-bellman-optimality}) instead of an average (\ref{state-value-one-step-backup}) over <span class="math">\(q\)</span>, leading to the <strong>Bellman optimality equation</strong> for <span class="math">\(v_*\)</span>:</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman-optimality} \tag{6}
v_*(s) &=& \max_a q_{\pi*}(s, a) \\
&=& \max_a \mathbb{E}_{\pi*} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\
&=& \max_a \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_*(s’) ]
\end{eqnarray}</div>
<p>The optimal policy immediately follows: take the action in a state that maximizes the right hand side of (\ref{state-value-bellman-optimality}). The <a href="https://en.wikipedia.org/wiki/Bellman_equation#Bellman's_Principle_of_Optimality">principle of optimality</a>, which applies to the Bellman optimality equation, means that this greedy policy actually corresponds to the optimal policy! Note: Unlike the Bellman expectation equations, the Bellman optimality equations are a nonlinear system of equations due to taking the max.</p>
<p>The Bellman optimality equation for the action value function <span class="math">\(q_*(s,a)\)</span><a name="optimalq"></a> is:</p>
<div class="math">\begin{eqnarray}\label{action-value-bellman-optimality} \tag{7}
q_*(s, a) &=& \mathbb{E}_{\pi*} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}', a') | S_t = s, A_t = a] \\
&=& \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a') ]
\end{eqnarray}</div>
<hr>
<p>Looking ahead: In practice, without a knowledge of the environment dynamics, <span class="caps">RL</span> algorithms based on solving value functions can approximate the expectation in (\ref{action-value-bellman-optimality}) by sampling, i.e. interacting with the environment, and iteratively selecting the action that corresponds to maximizing <span class="math">\(q\)</span> in each state that the agent lands in along its trajectory, which is possible since the maximum occurs <strong>inside</strong> the summation in (\ref{action-value-bellman-optimality}). In contrast, this sampling approach doesn’t work for (\ref{state-value-bellman-optimality}) because of the maximum <strong>outside</strong> the summation in…that’s why action value functions are so useful when we lack a model of the environment!</p>
<hr>
<p>Here is the optimal state value function and policy for the student example, which we solve for in a later post:</p>
<p><img alt="student MDP optimal value function" src="./images/student_mdp_optimal_values.png"></p>
<p>Comparing the values per state under the optimal policy vs the random policy, the value in every state under the optimal policy exceeds the value under the random policy.</p>
<h2>Summary</h2>
<p>We’ve discussed how the problem of sequential decision making can be framed as an <span class="caps">MDP</span> using the student toy <span class="caps">MDP</span> as an example. The goal in <span class="caps">RL</span> is to figure out a policy — what actions to take in each state — that maximizes our returns.</p>
<p>MDPs provide a framework for approaching the problem by defining the value of each state, the value functions, and using the value functions to define what a “best policy” means. The value functions are unique solutions to the Bellman equations, and the <span class="caps">MDP</span> is “solved” when we know the optimal value function.</p>
<p>Much of reinforcement learning centers around trying to solve these equations under different conditions, e.g. unknown environment dynamics and large — possibly continuous — states and/or action spaces that require approximations to the value functions.</p>
<p>We’ll discuss how we arrived at the solutions for this toy problem in a future post!</p>
<h3>Example code</h3>
<p>Code for sampling from the student environment under a random policy in order to generate the trajectories and histograms of returns is available in this <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a>.</p>
<p>The <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">code</a> for the student environment creates an environment with an <span class="caps">API</span> that is compatible with OpenAI gym — specifically, it is derived from the <code>gym.envs.toy_text.DiscreteEnv</code> environment.</p>
<p><a name="unique"><em></a>The uniqueness of the solution to the Bellman equations for finite MDPs is stated without proof in Ref [2], but Ref [1] motivates it briefly via the </em>contraction mapping theorem*.</p>
<h2>References</h2>
<p>[1] David Silver’s <span class="caps">RL</span> Course Lecture 2 - (<a href="https://www.youtube.com/watch?v=lfHX2hHRMVQ">video</a>,
<a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf">slides</a>)</p>
<p>[2] Sutton and Barto -
<a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> - Chapter 3: Finite Markov Decision Processes</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
<p id="post-share-links">
Like this post? Share on:
<a href="https://twitter.com/intent/tweet?text=Introduction%20to%20reinforcement%20learning%20by%C2%A0example&url=/intro-rl-toy-example&hashtags=reinforcement-learning,machine-learning,openai" target="_blank" rel="nofollow noopener noreferrer" title="Share on Twitter">Twitter</a>
|
<a href="https://www.facebook.com/sharer/sharer.php?u=/intro-rl-toy-example" target="_blank" rel="nofollow noopener noreferrer" title="Share on Facebook">Facebook</a>
|
<a href="mailto:?subject=Introduction%20to%20reinforcement%20learning%20by%C2%A0example&body=/intro-rl-toy-example" target="_blank" rel="nofollow noopener noreferrer" title="Share via Email">Email</a>
</p>
<hr />
<div class="author_blurb">
<a href="" target="_blank" rel="nofollow noopener noreferrer">
<img src=/images/cy_efavdb_headshot.jpg alt="Cathy Yeh Avatar" title="Cathy Yeh">
<span class="author_name">Cathy Yeh</span>
</a>
Cathy Yeh got a PhD at UC Santa Barbara studying soft-matter/polymer physics. After stints in a translational modeling group at Pfizer in San Diego, Driver (a no longer extant startup trying to match cancer patients to clinical trials) in San Francisco, and Square, she is currently participating in the OpenAI Scholars program.
</div>
<hr/>
<aside>
<nav>
<ul class="articles-timeline">
<li class="previous-article">« <a href="./world-wandering-dudes" title="Previous: A Framework for Studying Population Dynamics">A Framework for Studying Population Dynamics</a></li>
<li class="next-article"><a href="./reinforcement-learning-dynamic-programming" title="Next: Dynamic programming in reinforcement learning">Dynamic programming in reinforcement learning</a> »</li>
</ul>
</nav>
</aside>
</div>
<section id="article-sidebar" class="span2">
<h4>Published</h4>
<time itemprop="dateCreated" datetime="2020-03-11T12:00:00-07:00">Mar 11, 2020</time>
<h4>Category</h4>
<a class="category-link" href="./categories.html#machine-learning-ref">Machine learning</a>
<h4>Tags</h4>
<ul class="list-of-tags tags-in-article">
<li><a href="./tags.html#machine-learning-ref">machine learning
<span>10</span>
</a></li>
<li><a href="./tags.html#openai-ref">OpenAI
<span>4</span>
</a></li>
<li><a href="./tags.html#reinforcement-learning-ref">reinforcement learning
<span>7</span>
</a></li>
</ul>
<h4>Contact</h4>
<div id="sidebar-social-link">
<a href="https://twitter.com/efavdb" title="" target="_blank" rel="nofollow noopener noreferrer">
<svg xmlns="http://www.w3.org/2000/svg" aria-label="Twitter" role="img" viewBox="0 0 512 512"><rect width="512" height="512" rx="15%" fill="#1da1f3"/><path fill="#fff" d="M437 152a72 72 0 0 1-40 12 72 72 0 0 0 32-40 72 72 0 0 1-45 17 72 72 0 0 0-122 65 200 200 0 0 1-145-74 72 72 0 0 0 22 94 72 72 0 0 1-32-7 72 72 0 0 0 56 69 72 72 0 0 1-32 1 72 72 0 0 0 67 50 200 200 0 0 1-105 29 200 200 0 0 0 309-179 200 200 0 0 0 35-37"/></svg>
</a>
<a href="https://github.com/efavdb" title="" target="_blank" rel="nofollow noopener noreferrer">
<svg xmlns="http://www.w3.org/2000/svg" aria-label="GitHub" role="img" viewBox="0 0 512 512"><rect width="512" height="512" rx="15%" fill="#1B1817"/><path fill="#fff" d="M335 499c14 0 12 17 12 17H165s-2-17 12-17c13 0 16-6 16-12l-1-50c-71 16-86-28-86-28-12-30-28-37-28-37-24-16 1-16 1-16 26 2 40 26 40 26 22 39 59 28 74 22 2-17 9-28 16-35-57-6-116-28-116-126 0-28 10-51 26-69-3-6-11-32 3-67 0 0 21-7 70 26 42-12 86-12 128 0 49-33 70-26 70-26 14 35 6 61 3 67 16 18 26 41 26 69 0 98-60 120-117 126 10 8 18 24 18 48l-1 70c0 6 3 12 16 12z"/></svg>
</a>
<a href="https://www.youtube.com/channel/UClfvjoSiu0VvWOh5OpnuusA" title="" target="_blank" rel="nofollow noopener noreferrer">
<svg xmlns="http://www.w3.org/2000/svg" aria-label="YouTube" role="img" viewBox="0 0 512 512" fill="#ed1d24"><rect width="512" height="512" rx="15%"/><path d="m427 169c-4-15-17-27-32-31-34-9-239-10-278 0-15 4-28 16-32 31-9 38-10 135 0 174 4 15 17 27 32 31 36 10 241 10 278 0 15-4 28-16 32-31 9-36 9-137 0-174" fill="#fff"/><path d="m220 203v106l93-53"/></svg>
</a>
</div>
</section>
</div>
</article>
<button onclick="topFunction()" id="myBtn" title="Go to top">▲</button>
</div>
<div class="span1"></div>
</div>
</div>
</div>
<footer>
<div>
<span class="site-name"><a href= .>EFAVDB</span> - Everybody's Favorite Data Blog</a>
</div>
<!-- <div id="fpowered"> -->
<!-- Powered by: <a href="http://getpelican.com/" title="Pelican Home Page" target="_blank" rel="nofollow noopener noreferrer">Pelican</a> -->
<!-- Theme: <a href="https://elegant.oncrashreboot.com/" title="Theme Elegant Home Page" target="_blank" rel="nofollow noopener noreferrer">Elegant</a> -->
<!-- -->
<!-- </div> -->
</footer> <script src="//code.jquery.com/jquery.min.js"></script>
<script src="//netdna.bootstrapcdn.com/twitter-bootstrap/2.3.2/js/bootstrap.min.js"></script>
<script>
function validateForm(query)
{
return (query.length > 0);
}
</script>
<script>
//** Scroll to top button **
//Get the button:
mybutton = document.getElementById("myBtn");
// When the user scrolls down 30px from the top of the document, show the button
window.onscroll = function() {scrollFunction()};
function scrollFunction() {
if (document.body.scrollTop > 30 || document.documentElement.scrollTop > 30) {
mybutton.style.display = "block";
} else {
mybutton.style.display = "none";
}
}
// When the user clicks on the button, scroll to the top of the document
function topFunction() {
document.body.scrollTop = 0; // For Safari
document.documentElement.scrollTop = 0; // For Chrome, Firefox, IE and Opera
}
</script>
<script>
(function () {
if (window.location.hash.match(/^#comment-\d+$/)) {
$('#comment_thread').collapse('show');
}
})();
window.onhashchange=function(){
if (window.location.hash.match(/^#comment-\d+$/))
window.location.reload(true);
}
$('#comment_thread').on('shown', function () {
var link = document.getElementById('comment-accordion-toggle');
var old_innerHTML = link.innerHTML;
$(link).fadeOut(200, function() {
$(this).text('Click here to hide comments').fadeIn(200);
});
$('#comment_thread').on('hidden', function () {
$(link).fadeOut(200, function() {
$(this).text(old_innerHTML).fadeIn(200);
});
})
})
</script>
</body>
<!-- Theme: Elegant built for Pelican
License : MIT -->
</html>