-
Notifications
You must be signed in to change notification settings - Fork 91
/
Copy pathindex.xml
702 lines (600 loc) · 74.3 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Lil'Log</title>
<link>https://lilianweng.github.io/</link>
<description>Recent content on Lil'Log</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Thu, 28 Nov 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://lilianweng.github.io/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Reward Hacking in Reinforcement Learning</title>
<link>https://lilianweng.github.io/posts/2024-11-28-reward-hacking/</link>
<pubDate>Thu, 28 Nov 2024 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2024-11-28-reward-hacking/</guid>
<description><p>Reward hacking occurs when a <a href="(https://lilianweng.github.io/posts/2018-02-19-rl-overview/)">reinforcement learning (RL)</a> agent <a href="https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration">exploits</a> flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function.</p>
<p>With the rise of <a href="https://lilianweng.github.io/posts/2019-01-31-lm/">language models</a> generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user&rsquo;s preference, are pretty concerning and are likely one of the major blockers for real-world deployment of more autonomous use cases of AI models.</p></description>
</item>
<item>
<title>Extrinsic Hallucinations in LLMs</title>
<link>https://lilianweng.github.io/posts/2024-07-07-hallucination/</link>
<pubDate>Sun, 07 Jul 2024 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2024-07-07-hallucination/</guid>
<description><p>Hallucination in large language models usually refers to the model generating unfaithful, fabricated, inconsistent, or nonsensical content. As a term, hallucination has been somewhat generalized to cases when the model makes mistakes. Here, I would like to narrow down the problem of hallucination to cases where the model output is fabricated and <strong>not grounded</strong> by either the provided context or world knowledge.</p>
<p>There are two types of hallucination:</p>
<ol>
<li>In-context hallucination: The model output should be consistent with the source content in context.</li>
<li>Extrinsic hallucination: The model output should be grounded by the pre-training dataset. However, given the size of the pre-training dataset, it is too expensive to retrieve and identify conflicts per generation. If we consider the pre-training data corpus as a proxy for world knowledge, we essentially try to ensure the model output is factual and verifiable by external world knowledge. Equally importantly, when the model does not know about a fact, it should say so.</li>
</ol>
<p>This post focuses on extrinsic hallucination. To avoid hallucination, LLMs need to be (1) factual and (2) acknowledge not knowing the answer when applicable.</p></description>
</item>
<item>
<title>Diffusion Models for Video Generation</title>
<link>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</link>
<pubDate>Fri, 12 Apr 2024 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</guid>
<description><p><a href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/">Diffusion models</a> have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task&mdash;using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because:</p>
<ol>
<li>It has extra requirements on temporal consistency across frames in time, which naturally demands more world knowledge to be encoded into the model.</li>
<li>In comparison to text or images, it is more difficult to collect large amounts of high-quality, high-dimensional video data, let along text-video pairs.</li>
</ol>
<blockquote>
<p><br/><b>
🥑 Required Pre-read: Please make sure you have read the previous blog on <a href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/">&ldquo;What are Diffusion Models?&rdquo;</a> for image generation before continue here.
</b><br/><br/></p></description>
</item>
<item>
<title>Thinking about High-Quality Human Data</title>
<link>https://lilianweng.github.io/posts/2024-02-05-human-data-quality/</link>
<pubDate>Mon, 05 Feb 2024 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2024-02-05-human-data-quality/</guid>
<description><p><span class="update">[Special thank you to <a href="https://scholar.google.com/citations?user=FRBObOwAAAAJ&amp;hl=en">Ian Kivlichan</a> for many useful pointers (E.g. the 100+ year old Nature paper &ldquo;Vox populi&rdquo;) and nice feedback. 🙏 ]</span><br/></p>
<p>High-quality data is the fuel for modern data deep learning model training. Most of the task-specific labeled data comes from human annotation, such as classification task or <a href="https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/#rl-fine-tuning-with-human-preferences">RLHF</a> labeling (which can be constructed as classification format) for LLM alignment training. Lots of ML techniques in the post can help with data quality, but fundamentally human data collection involves attention to details and careful execution. The community knows the value of high quality data, but somehow we have this subtle impression that “Everyone wants to do the model work, not the data work” (<a href="https://dl.acm.org/doi/abs/10.1145/3411764.3445518">Sambasivan et al. 2021</a>).</p></description>
</item>
<item>
<title>Adversarial Attacks on LLMs</title>
<link>https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/</link>
<pubDate>Wed, 25 Oct 2023 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/</guid>
<description><p>The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via <a href="https://openai.com/research/learning-to-summarize-with-human-feedback">RLHF</a>). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired.</p>
<p>A large body of ground work on adversarial attacks is on images, and differently it operates in the continuous, high-dimensional space. Attacks for discrete data like text have been considered to be a lot more challenging, due to lack of direct gradient signals. My past post on <a href="https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/">Controllable Text Generation</a> is quite relevant to this topic, as attacking LLMs is essentially to control the model to output a certain type of (unsafe) content.</p></description>
</item>
<item>
<title>LLM Powered Autonomous Agents</title>
<link>https://lilianweng.github.io/posts/2023-06-23-agent/</link>
<pubDate>Fri, 23 Jun 2023 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2023-06-23-agent/</guid>
<description><p>Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as <a href="https://github.com/Significant-Gravitas/Auto-GPT">AutoGPT</a>, <a href="https://github.com/AntonOsika/gpt-engineer">GPT-Engineer</a> and <a href="https://github.com/yoheinakajima/babyagi">BabyAGI</a>, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.</p>
<h1 id="agent-system-overview">Agent System Overview</h1>
<p>In a LLM-powered autonomous agent system, LLM functions as the agent&rsquo;s brain, complemented by several key components:</p>
<ul>
<li><strong>Planning</strong>
<ul>
<li>Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.</li>
<li>Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.</li>
</ul>
</li>
<li><strong>Memory</strong>
<ul>
<li>Short-term memory: I would consider all the in-context learning (See <a href="https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/">Prompt Engineering</a>) as utilizing short-term memory of the model to learn.</li>
<li>Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.</li>
</ul>
</li>
<li><strong>Tool use</strong>
<ul>
<li>The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.</li>
</ul>
</li>
</ul>
<img src="agent-overview.png" style="width: 100%;" class="center" />
<figcaption>Fig. 1. Overview of a LLM-powered autonomous agent system.</figcaption>
<h1 id="component-one-planning">Component One: Planning</h1>
<p>A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.</p></description>
</item>
<item>
<title>Prompt Engineering</title>
<link>https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/</link>
<pubDate>Wed, 15 Mar 2023 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/</guid>
<description><p><strong>Prompt Engineering</strong>, also known as <strong>In-Context Prompting</strong>, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes <em>without</em> updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.</p>
<p>This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models. At its core, the goal of prompt engineering is about alignment and model steerability. Check my <a href="https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/">previous post</a> on controllable text generation.</p></description>
</item>
<item>
<title>The Transformer Family Version 2.0</title>
<link>https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/</link>
<pubDate>Fri, 27 Jan 2023 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/</guid>
<description><p>Many new Transformer architecture improvements have been proposed since my last post on <a href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/"><ins>&ldquo;The Transformer Family&rdquo;</ins></a> about three years ago. Here I did a big refactoring and enrichment of that 2020 post &mdash; restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.</p>
<h1 id="notations">Notations</h1>
<table>
<thead>
<tr>
<th>Symbol</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>$d$</td>
<td>The model size / hidden state dimension / positional encoding size.</td>
</tr>
<tr>
<td>$h$</td>
<td>The number of heads in multi-head attention layer.</td>
</tr>
<tr>
<td>$L$</td>
<td>The segment length of input sequence.</td>
</tr>
<tr>
<td>$N$</td>
<td>The total number of attention layers in the model; not considering MoE.</td>
</tr>
<tr>
<td>$\mathbf{X} \in \mathbb{R}^{L \times d}$</td>
<td>The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.</td>
</tr>
<tr>
<td>$\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$</td>
<td>The key weight matrix.</td>
</tr>
<tr>
<td>$\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$</td>
<td>The query weight matrix.</td>
</tr>
<tr>
<td>$\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$</td>
<td>The value weight matrix. Often we have $d_k = d_v = d$.</td>
</tr>
<tr>
<td>$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$</td>
<td>The weight matrices per head.</td>
</tr>
<tr>
<td>$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$</td>
<td>The output weight matrix.</td>
</tr>
<tr>
<td>$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$</td>
<td>The query embedding inputs.</td>
</tr>
<tr>
<td>$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$</td>
<td>The key embedding inputs.</td>
</tr>
<tr>
<td>$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$</td>
<td>The value embedding inputs.</td>
</tr>
<tr>
<td>$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$</td>
<td>Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$.</td>
</tr>
<tr>
<td>$S_i$</td>
<td>A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to.</td>
</tr>
<tr>
<td>$\mathbf{A} \in \mathbb{R}^{L \times L}$</td>
<td>The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$.</td>
</tr>
<tr>
<td>$a_{ij} \in \mathbf{A}$</td>
<td>The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$.</td>
</tr>
<tr>
<td>$\mathbf{P} \in \mathbb{R}^{L \times d}$</td>
<td>position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$.</td>
</tr>
</tbody>
</table>
<h1 id="transformer-basics">Transformer Basics</h1>
<p>The <strong>Transformer</strong> (which will be referred to as &ldquo;vanilla Transformer&rdquo; to distinguish it from other enhanced versions; <a href="https://arxiv.org/abs/1706.03762">Vaswani, et al., 2017</a>) model has an encoder-decoder architecture, as commonly used in many <a href="https://lilianweng.github.io/posts/2018-06-24-attention/#born-for-translation">NMT</a> models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only <a href="https://lilianweng.github.io/posts/2019-01-31-lm/#bert">BERT</a> or decoder-only <a href="https://lilianweng.github.io/posts/2019-01-31-lm/#openai-gpt">GPT</a>.</p></description>
</item>
<item>
<title>Large Transformer Model Inference Optimization</title>
<link>https://lilianweng.github.io/posts/2023-01-10-inference-optimization/</link>
<pubDate>Tue, 10 Jan 2023 10:00:00 -0700</pubDate>
<guid>https://lilianweng.github.io/posts/2023-01-10-inference-optimization/</guid>
<description><p><span class="update">[Updated on 2023-01-24: add a small section on <a href="#distillation">Distillation</a>.]</span><br/></p>
<p>Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.</p>
<p><strong>Why is it hard to run inference for large transformer models?</strong> Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (<a href="https://arxiv.org/abs/2211.05102">Pope et al. 2022</a>):</p></description>
</item>
<item>
<title>Some Math behind Neural Tangent Kernel</title>
<link>https://lilianweng.github.io/posts/2022-09-08-ntk/</link>
<pubDate>Thu, 08 Sep 2022 10:00:00 -0700</pubDate>
<guid>https://lilianweng.github.io/posts/2022-09-08-ntk/</guid>
<description><p>Neural networks are <a href="https://lilianweng.github.io/posts/2019-03-14-overfit/">well known</a> to be over-parameterized and can often easily fit data with near-zero training loss with decent generalization performance on test dataset. Although all these parameters are initialized at random, the optimization process can consistently lead to similarly good outcomes. And this is true even when the number of model parameters exceeds the number of training data points.</p>
<p><strong>Neural tangent kernel (NTK)</strong> (<a href="https://arxiv.org/abs/1806.07572">Jacot et al. 2018</a>) is a kernel to explain the evolution of neural networks during training via gradient descent. It leads to great insights into why neural networks with enough width can consistently converge to a global minimum when trained to minimize an empirical loss. In the post, we will do a deep dive into the motivation and definition of NTK, as well as the proof of a deterministic convergence at different initializations of neural networks with infinite width by characterizing NTK in such a setting.</p></description>
</item>
<item>
<title>Generalized Visual Language Models</title>
<link>https://lilianweng.github.io/posts/2022-06-09-vlm/</link>
<pubDate>Thu, 09 Jun 2022 15:10:30 -0700</pubDate>
<guid>https://lilianweng.github.io/posts/2022-06-09-vlm/</guid>
<description><p>Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a large amount of existing literature, in this post, I would like to only focus on one approach for solving vision language tasks, which is to <em>extend pre-trained <a href="https://lilianweng.github.io/posts/2019-01-31-lm/">generalized language models</a> to be capable of consuming visual signals</em>.</p></description>
</item>
<item>
<title>Learning with not Enough Data Part 3: Data Generation</title>
<link>https://lilianweng.github.io/posts/2022-04-15-data-gen/</link>
<pubDate>Fri, 15 Apr 2022 15:10:30 -0700</pubDate>
<guid>https://lilianweng.github.io/posts/2022-04-15-data-gen/</guid>
<description><p>Here comes the Part 3 on learning with not enough data (Previous: <a href="https://lilianweng.github.io/posts/2021-12-05-semi-supervised/">Part 1</a> and <a href="https://lilianweng.github.io/posts/2022-02-20-active-learning/">Part 2</a>). Let’s consider two approaches for generating synthetic data for training.</p>
<ul>
<li><strong>Augmented data</strong>. Given a set of existing training samples, we can apply a variety of augmentation, distortion and transformation to derive new data points without losing the key attributes. We have covered a bunch of augmentation methods on text and images in a <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">previous post</a> on contrastive learning. For the sake of post completeness, I <em>duplicate</em> the section on data augmentation here with some edits.</li>
<li><strong>New data</strong>. Given few or even no data points, we can rely on powerful pretrained models to generate a number of <em>new</em> data points. This is especially true in recent years given the fast progress in large pretrained <a href="https://lilianweng.github.io/posts/2019-01-31-lm/">language models (LM)</a>. Few shot prompting is shown to be effective for LM to learn within context without extra training.</li>
</ul>
<h1 id="data-augmentation">Data Augmentation</h1>
<p>The goal of data augmentation is to modify the input format (e.g. text wording, visual appearance) while the semantic meaning stays unchanged.</p></description>
</item>
<item>
<title>Learning with not Enough Data Part 2: Active Learning</title>
<link>https://lilianweng.github.io/posts/2022-02-20-active-learning/</link>
<pubDate>Sun, 20 Feb 2022 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2022-02-20-active-learning/</guid>
<description><!-- The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. Active learning is one paradigm to deal with not enough labeled data, when there are resources for labeling more data samples but under a limited budget. -->
<p>This is part 2 of what to do when facing a limited amount of labeled data for supervised learning tasks. This time we will get some amount of human labeling work involved, but within a budget limit, and therefore we need to be smart when selecting which samples to label.</p></description>
</item>
<item>
<title>Learning with not Enough Data Part 1: Semi-Supervised Learning</title>
<link>https://lilianweng.github.io/posts/2021-12-05-semi-supervised/</link>
<pubDate>Sun, 05 Dec 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-12-05-semi-supervised/</guid>
<description><!-- The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data. -->
<p>When facing a limited amount of labeled data for supervised learning tasks, four approaches are commonly discussed.</p></description>
</item>
<item>
<title>How to Train Really Large Models on Many GPUs?</title>
<link>https://lilianweng.github.io/posts/2021-09-25-train-large/</link>
<pubDate>Fri, 24 Sep 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-09-25-train-large/</guid>
<description><!-- How to train large and deep neural networks is challenging, as it demands a large amount of GPU memory and a long horizon of training time. This post reviews several popular training parallelism paradigms, as well as a variety of model architecture and memory saving designs to make it possible to train very large neural networks across a large number of GPUs. -->
<p><span class="update">[Updated on 2022-03-13: add <a href="#ec">expert choice routing</a>.]</span><br/>
<span class="update">[Updated on 2022-06-10]: <a href="https://gregbrockman.com/">Greg</a> and I wrote a shorted and upgraded version of this post, published on OpenAI Blog: <a href="https://openai.com/blog/techniques-for-training-large-neural-networks/">&ldquo;Techniques for Training Large Neural Networks&rdquo;</a></p></description>
</item>
<item>
<title>What are Diffusion Models?</title>
<link>https://lilianweng.github.io/posts/2021-07-11-diffusion-models/</link>
<pubDate>Sun, 11 Jul 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-07-11-diffusion-models/</guid>
<description><!-- Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution. It has been shown recently that diffusion models can generate high-quality images and the performance is competitive to SOTA GAN. -->
<p><span class="update">[Updated on 2021-09-19: Highly recommend this blog post on <a href="https://yang-song.github.io/blog/2021/score/">score-based generative modeling</a> by Yang Song (author of several key papers in the references)].</span><br/>
<span class="update">[Updated on 2022-08-27: Added <a href="#classifier-free-guidance">classifier-free guidance</a>, <a href="#glide">GLIDE</a>, <a href="#unclip">unCLIP</a> and <a href="#imagen">Imagen</a>.</span><br/>
<span class="update">[Updated on 2022-08-31: Added <a href="#ldm">latent diffusion model</a>.</span><br/>
<span class="update">[Updated on 2024-04-13: Added <a href="#prog-distll">progressive distillation</a>, <a href="#consistency">consistency models</a>, and the <a href="#model-architecture">Model Architecture section</a>.</span></p></description>
</item>
<item>
<title>Contrastive Representation Learning</title>
<link>https://lilianweng.github.io/posts/2021-05-31-contrastive/</link>
<pubDate>Mon, 31 May 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-05-31-contrastive/</guid>
<description><!-- The main idea of contrastive learning is to learn representations such that similar samples stay close to each other, while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised data and has been shown to achieve good performance on a variety of vision and language tasks. -->
<p>The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. When working with unsupervised data, contrastive learning is one of the most powerful approaches in <a href="https://lilianweng.github.io/posts/2019-11-10-self-supervised/">self-supervised learning</a>.</p></description>
</item>
<item>
<title>Reducing Toxicity in Language Models</title>
<link>https://lilianweng.github.io/posts/2021-03-21-lm-toxicity/</link>
<pubDate>Sun, 21 Mar 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-03-21-lm-toxicity/</guid>
<description><!-- Toxicity prevents us from safely deploying powerful pretrained language models for real-world applications. To reduce toxicity in language models, in this post, we will delve into three aspects of the problem: training dataset collection, toxic content detection and model detoxification. -->
<p>Large pretrained <a href="https://lilianweng.github.io/posts/2019-01-31-lm/">language models</a> are trained over a sizable collection of online data. They unavoidably acquire certain toxic behavior and biases from the Internet. Pretrained language models are very powerful and have shown great success in many NLP tasks. However, to safely deploy them for practical real-world applications demands a strong safety control over the model generation process.</p></description>
</item>
<item>
<title>Controllable Neural Text Generation</title>
<link>https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/</link>
<pubDate>Sat, 02 Jan 2021 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2021-01-02-controllable-text-generation/</guid>
<description><!-- The modern language model with SOTA results on many NLP tasks is trained on large scale free text on the Internet. It is challenging to steer such a model to generate content with desired attributes. Although still not perfect, there are several approaches for controllable text generation, such as guided or learned decoding strategy, smart prompt design, or fine-tuning the model with various methods. -->
<p><span class="update">[Updated on 2021-02-01: Updated to version 2.0 with several work added and many typos fixed.]</span>
<br />
<span class="update">[Updated on 2021-05-26: Add P-tuning and Prompt Tuning in the <a href="#gradient-based-search">&ldquo;prompt design&rdquo;</a> section.]</span>
<br />
<span class="update">[Updated on 2021-09-19: Add <a href="##unlikelihood-training">&ldquo;unlikelihood training&rdquo;</a>.]</span></p></description>
</item>
<item>
<title>How to Build an Open-Domain Question Answering System?</title>
<link>https://lilianweng.github.io/posts/2020-10-29-odqa/</link>
<pubDate>Thu, 29 Oct 2020 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2020-10-29-odqa/</guid>
<description><!-- A model that is capable of answering any question with regard to factual knowledge can enable many useful applications. This post delves into how we can build an Open-Domain Question Answering (ODQA) system, assuming we have access to a powerful pretrained language model. Both closed-book and open-book approachs are discussed. -->
<p><span class="update">[Updated on 2020-11-12: add <a href="#openai-api-example">an example</a> on closed-book factual QA using OpenAI API (beta).</span></p>
<p>A model that can answer any question with regard to factual knowledge can lead to many useful and practical applications, such as working as a chatbot or an AI assistant🤖. In this post, we will review several common approaches for building such an open-domain question answering system.</p></description>
</item>
<item>
<title>Neural Architecture Search</title>
<link>https://lilianweng.github.io/posts/2020-08-06-nas/</link>
<pubDate>Thu, 06 Aug 2020 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2020-08-06-nas/</guid>
<description><!-- Neural Architecture Search (NAS) automates network architecture engineering. It aims to learn a network topology that can achieve best performance on a certain task. By dissecting the methods for NAS into three components: search space, search algorithm and child model evolution strategy, this post reviews many interesting ideas for better, faster and more cost-efficient automatic neural architecture search. -->
<p>Although most popular and successful model architectures are designed by human experts, it doesn&rsquo;t mean we have explored the entire network architecture space and settled down with the best option. We would have a better chance to find the optimal solution if we adopt a systematic and automatic way of learning high-performance model architectures.</p></description>
</item>
<item>
<title>Exploration Strategies in Deep Reinforcement Learning</title>
<link>https://lilianweng.github.io/posts/2020-06-07-exploration-drl/</link>
<pubDate>Sun, 07 Jun 2020 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2020-06-07-exploration-drl/</guid>
<description><!-- Exploitation versus exploration is a critical topic in reinforcement learning. This post introduces several common approaches for better exploration in Deep RL. -->
<p><span class="update">[Updated on 2020-06-17: Add <a href="#exploration-via-disagreement">&ldquo;exploration via disagreement&rdquo;</a> in the &ldquo;Forward Dynamics&rdquo; <a href="#forward-dynamics">section</a>.</span></p>
<p><a href="https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/">Exploitation versus exploration</a> is a critical topic in Reinforcement Learning. We&rsquo;d like the RL agent to find the best solution as fast as possible. However, in the meantime, committing to solutions too quickly without enough exploration sounds pretty bad, as it could lead to local minima or total failure. Modern <a href="https://lilianweng.github.io/posts/2018-02-19-rl-overview/">RL</a> <a href="https://lilianweng.github.io/posts/2018-04-08-policy-gradient/">algorithms</a> that optimize for the best returns can achieve good exploitation quite efficiently, while exploration remains more like an open topic.</p></description>
</item>
<item>
<title>The Transformer Family</title>
<link>https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/</link>
<pubDate>Tue, 07 Apr 2020 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/</guid>
<description><!-- Inspired by recent progress on various enhanced versions of Transformer models, this post presents how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving, etc. -->
<p><span class="update">[Updated on <mark><strong>2023-01-27</strong></mark>: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: <a href="https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/"><mark><b>The Transformer Family Version 2.0</b></mark></a>. Please refer to that post on this topic.]</span>
<br/></p></description>
</item>
<item>
<title>Curriculum for Reinforcement Learning</title>
<link>https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/</link>
<pubDate>Wed, 29 Jan 2020 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/</guid>
<description><!-- A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help reinforcement learning models learn to solve complicated tasks. -->
<p><span class="update">[Updated on 2020-02-03: mentioning <a href="#pcg">PCG</a> in the &ldquo;Task-Specific Curriculum&rdquo; section.</span><br/>
<span class="update">[Updated on 2020-02-04: Add a new <a href="#curriculum-through-distillation">&ldquo;curriculum through distillation&rdquo;</a> section.</span></p></description>
</item>
<item>
<title>Self-Supervised Representation Learning</title>
<link>https://lilianweng.github.io/posts/2019-11-10-self-supervised/</link>
<pubDate>Sun, 10 Nov 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-11-10-self-supervised/</guid>
<description><!-- Self-supervised learning opens up a huge opportunity for better utilizing unlabelled data, while learning in a supervised learning manner. This post covers many interesting ideas of self-supervised learning tasks on images, videos, and control problems. -->
<p><span class="update">[Updated on 2020-01-09: add a new section on <a href="#contrastive-predictive-coding">Contrastive Predictive Coding</a>].</span>
<br/>
<del><span class="update">[Updated on 2020-04-13: add a &ldquo;Momentum Contrast&rdquo; section on MoCo, SimCLR and CURL.]</span></del>
<br/>
<span class="update">[Updated on 2020-07-08: add a <a href="#bisimulation">&ldquo;Bisimulation&rdquo;</a> section on DeepMDP and DBC.]</span>
<br/>
<del><span class="update">[Updated on 2020-09-12: add <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/#moco--moco-v2">MoCo V2</a> and <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/#byol">BYOL</a> in the &ldquo;Momentum Contrast&rdquo; section.]</span></del>
<br/>
<span class="update">[Updated on 2021-05-31: remove section on &ldquo;Momentum Contrast&rdquo; and add a pointer to a full post on <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">&ldquo;Contrastive Representation Learning&rdquo;</a>]</span></p></description>
</item>
<item>
<title>Evolution Strategies</title>
<link>https://lilianweng.github.io/posts/2019-09-05-evolution-strategies/</link>
<pubDate>Thu, 05 Sep 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-09-05-evolution-strategies/</guid>
<description><!-- Gradient descent is not the only option when learning optimal model parameters. Evolution Strategies (ES) works out well in the cases where we don't know the precise analytic form of an objective function or cannot compute the gradients directly. This post dives into several classic ES methods, as well as how ES can be used in deep reinforcement learning. -->
<p>Stochastic gradient descent is a universal choice for optimizing deep learning models. However, it is not the only option. With black-box optimization algorithms, you can evaluate a target function $f(x): \mathbb{R}^n \to \mathbb{R}$, even when you don&rsquo;t know the precise analytic form of $f(x)$ and thus cannot compute gradients or the Hessian matrix. Examples of black-box optimization methods include <a href="https://en.wikipedia.org/wiki/Simulated_annealing">Simulated Annealing</a>, <a href="https://en.wikipedia.org/wiki/Hill_climbing">Hill Climbing</a> and <a href="https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method">Nelder-Mead method</a>.</p></description>
</item>
<item>
<title>Meta Reinforcement Learning</title>
<link>https://lilianweng.github.io/posts/2019-06-23-meta-rl/</link>
<pubDate>Sun, 23 Jun 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-06-23-meta-rl/</guid>
<description><!-- Meta-RL is meta-learning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of meta-RL and then dives into three key components of meta-RL. -->
<p>In my earlier post on <a href="https://lilianweng.github.io/posts/2018-11-30-meta-learning/">meta-learning</a>, the problem is mainly defined in the context of few-shot classification. Here I would like to explore more into cases when we try to &ldquo;meta-learn&rdquo; <a href="https://lilianweng.github.io/posts/2018-02-19-rl-overview/">Reinforcement Learning (RL)</a> tasks by developing an agent that can solve unseen tasks fast and efficiently.</p></description>
</item>
<item>
<title>Domain Randomization for Sim2Real Transfer</title>
<link>https://lilianweng.github.io/posts/2019-05-05-domain-randomization/</link>
<pubDate>Sun, 05 May 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-05-05-domain-randomization/</guid>
<description><!-- If a model or policy is mainly trained in a simulator but expected to work on a real robot, it would surely face the sim2real gap. *Domain Randomization* (DR) is a simple but powerful idea of closing this gap by randomizing properties of the training environment. -->
<p>In Robotics, one of the hardest problems is how to make your model transfer to the real world. Due to the sample inefficiency of deep RL algorithms and the cost of data collection on real robots, we often need to train models in a simulator which theoretically provides an infinite amount of data. However, the reality gap between the simulator and the physical world often leads to failure when working with physical robots. The gap is triggered by an inconsistency between physical parameters (i.e. friction, kp, damping, mass, density) and, more fatally, the incorrect physical modeling (i.e. collision between soft surfaces).</p></description>
</item>
<item>
<title>Are Deep Neural Networks Dramatically Overfitted?</title>
<link>https://lilianweng.github.io/posts/2019-03-14-overfit/</link>
<pubDate>Thu, 14 Mar 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-03-14-overfit/</guid>
<description><!-- If you are, like me, confused by why deep neural networks can generalize to out-of-sample data points without drastic overfitting, keep on reading. -->
<p><span class="update">[Updated on 2019-05-27: add the <a href="#the-lottery-ticket-hypothesis">section</a> on Lottery Ticket Hypothesis.]</span></p>
<p>If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points?</p></description>
</item>
<item>
<title>Generalized Language Models</title>
<link>https://lilianweng.github.io/posts/2019-01-31-lm/</link>
<pubDate>Thu, 31 Jan 2019 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2019-01-31-lm/</guid>
<description><!-- As a follow up of word embedding post, we will discuss the models on learning contextualized word vectors, as well as the new trend in large unsupervised pre-trained language models which have achieved amazing SOTA results on a variety of language tasks. -->
<p><span class="update">[Updated on 2019-02-14: add <a href="#ulmfit">ULMFiT</a> and <a href="#gpt-2">GPT-2</a>.]</span><br/>
<span class="update">[Updated on 2020-02-29: add <a href="#albert">ALBERT</a>.]</span><br/>
<span class="update">[Updated on 2020-10-25: add <a href="#roberta">RoBERTa</a>.]</span><br/>
<span class="update">[Updated on 2020-12-13: add <a href="#t5">T5</a>.]</span><br/>
<span class="update">[Updated on 2020-12-30: add <a href="#gpt-3">GPT-3</a>.]</span><br/>
<span class="update">[Updated on 2021-11-13: add <a href="#xlnet">XLNet</a>, <a href="#bart">BART</a> and <a href="#electra">ELECTRA</a>; Also updated the <a href="#summary">Summary</a> section.]</span></p>
<br />
<img src="elmo-and-bert.png" style="width: 60%;" class="center" />
<figcaption>Fig. 0. I guess they are Elmo & Bert? (Image source: <a href="https://www.youtube.com/watch?v=l5einDQ-Ttc" target="_blank">here</a>)</figcaption>
<p>We have seen amazing progress in NLP in 2018. Large-scale pre-trained language modes like <a href="https://blog.openai.com/language-unsupervised/">OpenAI GPT</a> and <a href="https://arxiv.org/abs/1810.04805">BERT</a> have achieved great performance on a variety of language tasks using generic model architectures. The idea is similar to how ImageNet classification pre-training helps many vision tasks (*). Even better than vision classification pre-training, this simple and powerful approach in NLP does not require labeled data for pre-training, allowing us to experiment with increased training scale, up to our very limit.</p></description>
</item>
<item>
<title>Object Detection Part 4: Fast Detection Models</title>
<link>https://lilianweng.github.io/posts/2018-12-27-object-recognition-part-4/</link>
<pubDate>Thu, 27 Dec 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-12-27-object-recognition-part-4/</guid>
<description><!-- Part 4 of the "Object Detection for Dummies" series focuses on one-stage models for fast detection, including SSD, RetinaNet, and models in the YOLO family. These models skip the explicit region proposal stage but apply the detection directly on dense sampled areas. -->
<p>In <a href="https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3/">Part 3</a>, we have reviewed models in the R-CNN family. All of them are region-based object detection algorithms. They can achieve high accuracy but could be too slow for certain applications such as autonomous driving. In Part 4, we only focus on fast object detection models, including SSD, RetinaNet, and models in the YOLO family.</p></description>
</item>
<item>
<title>Meta-Learning: Learning to Learn Fast</title>
<link>https://lilianweng.github.io/posts/2018-11-30-meta-learning/</link>
<pubDate>Fri, 30 Nov 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-11-30-meta-learning/</guid>
<description><!-- Meta-learning, also known as "learning to learn", intends to design models that can learn new skills or adapt to new environments rapidly with a few training examples. There are three common approaches: 1) learn an efficient distance metric (metric-based); 2) use (recurrent) network with external or internal memory (model-based); 3) optimize the model parameters explicitly for fast learning (optimization-based). -->
<p><span class="update">[Updated on 2019-10-01: thanks to Tianhao, we have this post translated in <a href="https://wei-tianhao.github.io/blog/2019/09/17/meta-learning.html">Chinese</a>!]</span></p></description>
</item>
<item>
<title>Flow-based Deep Generative Models</title>
<link>https://lilianweng.github.io/posts/2018-10-13-flow-models/</link>
<pubDate>Sat, 13 Oct 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-10-13-flow-models/</guid>
<description><!-- In this post, we are looking into the third type of generative models: flow-based generative models. Different from GAN and VAE, they explicitly learn the probability density function of the input data. -->
<p>So far, I&rsquo;ve written about two types of generative models, <a href="https://lilianweng.github.io/posts/2017-08-20-gan/">GAN</a> and <a href="https://lilianweng.github.io/posts/2018-08-12-vae/">VAE</a>. Neither of them explicitly learns the probability density function of real data, $p(\mathbf{x})$ (where $\mathbf{x} \in \mathcal{D}$) &mdash; because it is really hard! Taking the generative model with latent variables as an example, $p(\mathbf{x}) = \int p(\mathbf{x}\vert\mathbf{z})p(\mathbf{z})d\mathbf{z}$ can hardly be calculated as it is intractable to go through all possible values of the latent code $\mathbf{z}$.</p></description>
</item>
<item>
<title>From Autoencoder to Beta-VAE</title>
<link>https://lilianweng.github.io/posts/2018-08-12-vae/</link>
<pubDate>Sun, 12 Aug 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-08-12-vae/</guid>
<description><!-- Autocoders are a family of neural network models aiming to learn compressed latent variables of high-dimensional data. Starting from the basic autocoder model, this post reviews several variations, including denoising, sparse, and contractive autoencoders, and then Variational Autoencoder (VAE) and its modification beta-VAE. -->
<p><span class="update">[Updated on 2019-07-18: add a section on <a href="#vq-vae-and-vq-vae-2">VQ-VAE &amp; VQ-VAE-2</a>.]</span>
<br/>
<span class="update">[Updated on 2019-07-26: add a section on <a href="#td-vae">TD-VAE</a>.]</span>
<br/></p>
<p>Autocoder is invented to reconstruct high-dimensional data using a neural network model with a narrow bottleneck layer in the middle (oops, this is probably not true for <a href="#vae-variational-autoencoder">Variational Autoencoder</a>, and we will investigate it in details in later sections). A nice byproduct is dimension reduction: the bottleneck layer captures a compressed latent encoding. Such a low-dimensional representation can be used as en embedding vector in various applications (i.e. search), help data compression, or reveal the underlying data generative factors.</p></description>
</item>
<item>
<title>Attention? Attention!</title>
<link>https://lilianweng.github.io/posts/2018-06-24-attention/</link>
<pubDate>Sun, 24 Jun 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-06-24-attention/</guid>
<description><!-- Attention has been a fairly popular concept and a useful tool in the deep learning community in recent years. In this post, we are gonna look into how attention was invented, and various attention mechanisms and models, such as transformer and SNAIL. -->
<p><span class="update">[Updated on 2018-10-28: Add <a href="#pointer-network">Pointer Network</a> and the <a href="https://github.com/lilianweng/transformer-tensorflow">link</a> to my implementation of Transformer.]</span><br/>
<span class="update">[Updated on 2018-11-06: Add a <a href="https://github.com/lilianweng/transformer-tensorflow">link</a> to the implementation of Transformer model.]</span><br/>
<span class="update">[Updated on 2018-11-18: Add <a href="#neural-turing-machines">Neural Turing Machines</a>.]</span><br/>
<span class="update">[Updated on 2019-07-18: Correct the mistake on using the term &ldquo;self-attention&rdquo; when introducing the <a href="https://arxiv.org/abs/1502.03044">show-attention-tell</a> paper; moved it to <a href="#self-attention">Self-Attention</a> section.]</span><br/>
<span class="update">[Updated on 2020-04-07: A follow-up post on improved Transformer models is <a href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/">here</a>.]</span></p></description>
</item>
<item>
<title>Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym</title>
<link>https://lilianweng.github.io/posts/2018-05-05-drl-implementation/</link>
<pubDate>Sat, 05 May 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-05-05-drl-implementation/</guid>
<description><!-- Let's see how to implement a number of classic deep reinforcement learning models in code. -->
<p>The full implementation is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">lilianweng/deep-reinforcement-learning-gym</a></p>
<p>In the previous two posts, I have introduced the algorithms of many deep reinforcement learning models. Now it is the time to get our hands dirty and practice how to implement the models in the wild. The implementation is gonna be built in Tensorflow and OpenAI <a href="https://github.com/openai/gym">gym</a> environment. The full version of the code in this tutorial is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">[lilian/deep-reinforcement-learning-gym]</a>.</p></description>
</item>
<item>
<title>Policy Gradient Algorithms</title>
<link>https://lilianweng.github.io/posts/2018-04-08-policy-gradient/</link>
<pubDate>Sun, 08 Apr 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-04-08-policy-gradient/</guid>
<description><!-- Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, SAC, TD3 & SVPG. -->
<p><span class="update">[Updated on 2018-06-30: add two new policy gradient methods, <a href="#sac">SAC</a> and <a href="#d4pg">D4PG</a>.]</span>
<br/>
<span class="update">[Updated on 2018-09-30: add a new policy gradient method, <a href="#td3">TD3</a>.]</span>
<br/>
<span class="update">[Updated on 2019-02-09: add <a href="#sac-with-automatically-adjusted-temperature">SAC with automatically adjusted temperature</a>].</span>
<br/>
<span class="update">[Updated on 2019-06-26: Thanks to Chanseok, we have a version of this post in <a href="https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-Algorithms">Korean</a>].</span>
<br/>
<span class="update">[Updated on 2019-09-12: add a new policy gradient method <a href="#svpg">SVPG</a>.]</span>
<br/>
<span class="update">[Updated on 2019-12-22: add a new policy gradient method <a href="#impala">IMPALA</a>.]</span>
<br/>
<span class="update">[Updated on 2020-10-15: add a new policy gradient method <a href="#ppg">PPG</a> &amp; some new discussion in <a href="#ppo">PPO</a>.]</span>
<br/>
<span class="update">[Updated on 2021-09-19: Thanks to Wenhao &amp; 爱吃猫的鱼, we have this post in <a href="https://tomaxent.com/2019/04/14/%E7%AD%96%E7%95%A5%E6%A2%AF%E5%BA%A6%E6%96%B9%E6%B3%95/">Chinese1</a> &amp; <a href="https://paperexplained.cn/articles/article/detail/31/">Chinese2</a>].</span></p></description>
</item>
<item>
<title>A (Long) Peek into Reinforcement Learning</title>
<link>https://lilianweng.github.io/posts/2018-02-19-rl-overview/</link>
<pubDate>Mon, 19 Feb 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-02-19-rl-overview/</guid>
<description><!-- In this post, we are gonna briefly go over the field of Reinforcement Learning (RL), from fundamental concepts to classic algorithms. Hopefully, this review is helpful enough so that newbies would not get lost in specialized terms and jargons while starting. [WARNING] This is a long read. -->
<p><span class="update">[Updated on 2020-09-03: Updated the algorithm of <a href="#sarsa-on-policy-td-control">SARSA</a> and <a href="#q-learning-off-policy-td-control">Q-learning</a> so that the difference is more pronounced.</span>
<br />
<span class="update">[Updated on 2021-09-19: Thanks to 爱吃猫的鱼, we have this post in <a href="https://paperexplained.cn/articles/article/detail/33/">Chinese</a>].</span></p></description>
</item>
<item>
<title>The Multi-Armed Bandit Problem and Its Solutions</title>
<link>https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/</link>
<pubDate>Tue, 23 Jan 2018 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/</guid>
<description><!-- The multi-armed bandit problem is a class example to demonstrate the exploration versus exploitation dilemma. This post introduces the bandit problem and how to solve it using different exploration strategies. -->
<p>The algorithms are implemented for Bernoulli bandit in <a href="http://github.com/lilianweng/multi-armed-bandit">lilianweng/multi-armed-bandit</a>.</p>
<h1 id="exploitation-vs-exploration">Exploitation vs Exploration</h1>
<p>The exploration vs exploitation dilemma exists in many aspects of our life. Say, your favorite restaurant is right around the corner. If you go there every day, you would be confident of what you will get, but miss the chances of discovering an even better option. If you try new places all the time, very likely you are gonna have to eat unpleasant food from time to time. Similarly, online advisors try to balance between the known most attractive ads and the new ads that might be even more successful.</p></description>
</item>
<item>
<title>Object Detection for Dummies Part 3: R-CNN Family</title>
<link>https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3/</link>
<pubDate>Sun, 31 Dec 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3/</guid>
<description><!-- In Part 3, we would examine four object detection models: R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN. These models are highly related and the new versions show great speed improvement compared to the older ones. -->
<p><span class="update">[Updated on 2018-12-20: Remove YOLO here. Part 4 will cover multiple fast object detection algorithms, including YOLO.]</span>
<br/>
<span class="update">[Updated on 2018-12-27: Add <a href="#bounding-box-regression">bbox regression</a> and <a href="#common-tricks">tricks</a> sections for R-CNN.]</span></p>
<p>In the series of &ldquo;Object Detection for Dummies&rdquo;, we started with basic concepts in image processing, such as gradient vectors and HOG, in <a href="https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/">Part 1</a>. Then we introduced classic convolutional neural network architecture designs for classification and pioneer models for object recognition, Overfeat and DPM, in <a href="https://lilianweng.github.io/posts/2017-12-15-object-recognition-part-2/">Part 2</a>. In the third post of this series, we are about to review a set of models in the R-CNN (&ldquo;Region-based CNN&rdquo;) family.</p></description>
</item>
<item>
<title>Object Detection for Dummies Part 2: CNN, DPM and Overfeat</title>
<link>https://lilianweng.github.io/posts/2017-12-15-object-recognition-part-2/</link>
<pubDate>Fri, 15 Dec 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-12-15-object-recognition-part-2/</guid>
<description><!-- Part 2 introduces several classic convolutional neural work architecture designs for image classification (AlexNet, VGG, ResNet), as well as DPM (Deformable Parts Model) and Overfeat models for object recognition. -->
<p><a href="https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/">Part 1</a> of the &ldquo;Object Detection for Dummies&rdquo; series introduced: (1) the concept of image gradient vector and how HOG algorithm summarizes the information across all the gradient vectors in one image; (2) how the image segmentation algorithm works to detect regions that potentially contain objects; (3) how the Selective Search algorithm refines the outcomes of image segmentation for better region proposal.</p></description>
</item>
<item>
<title>Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS</title>
<link>https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/</link>
<pubDate>Sun, 29 Oct 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/</guid>
<description><!-- In this series of posts on "Object Detection for Dummies", we will go through several basic concepts, algorithms, and popular deep learning models for image processing and objection detection. Hopefully, it would be a good read for people with no experience in this field but want to learn more. The Part 1 introduces the concept of Gradient Vectors, the HOG (Histogram of Oriented Gradients) algorithm, and Selective Search for image segmentation. -->
<p>I&rsquo;ve never worked in the field of computer vision and has no idea how the magic could work when an autonomous car is configured to tell apart a stop sign from a pedestrian in a red hat. To motivate myself to look into the maths behind object recognition and detection algorithms, I&rsquo;m writing a few posts on this topic &ldquo;Object Detection for Dummies&rdquo;. This post, part 1, starts with super rudimentary concepts in image processing and a few methods for image segmentation. Nothing related to deep neural networks yet. Deep learning models for object detection and recognition will be discussed in <a href="https://lilianweng.github.io/posts/2017-12-15-object-recognition-part-2/">Part 2</a> and <a href="https://lilianweng.github.io/posts/2017-12-31-object-recognition-part-3/">Part 3</a>.</p></description>
</item>
<item>
<title>Learning Word Embedding</title>
<link>https://lilianweng.github.io/posts/2017-10-15-word-embedding/</link>
<pubDate>Sun, 15 Oct 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-10-15-word-embedding/</guid>
<description><!-- Word embedding is a dense representation of words in the form of numeric vectors. It can be learned using a variety of language models. The word embedding representation is able to reveal many hidden relationships between words. For example, vector("cat") - vector("kitten") is similar to vector("dog") - vector("puppy"). This post introduces several models for learning word embedding and how their loss functions are designed for the purpose. -->
<p>Human vocabulary comes in free text. In order to make a machine learning model understand and process the natural language, we need to transform the free-text words into numeric values. One of the simplest transformation approaches is to do a one-hot encoding in which each distinct word stands for one dimension of the resulting vector and a binary value indicates whether the word presents (1) or not (0).</p></description>
</item>
<item>
<title>Anatomize Deep Learning with Information Theory</title>
<link>https://lilianweng.github.io/posts/2017-09-28-information-bottleneck/</link>
<pubDate>Thu, 28 Sep 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-09-28-information-bottleneck/</guid>
<description><!-- This post is a summary of Prof Naftali Tishby's recent talk on "Information Theory in Deep Learning". It presented how to apply the information theory to study the growth and transformation of deep neural networks during training. -->
<p><span class="update">Professor Naftali Tishby passed away in 2021. Hope the post can introduce his cool idea of information bottleneck to more people.</span></p>
<p>Recently I watched the talk <a href="https://youtu.be/bLqJHjXihK8">&ldquo;Information Theory in Deep Learning&rdquo;</a> by Prof Naftali Tishby and found it very interesting. He presented how to apply the information theory to study the growth and transformation of deep neural networks during training. Using the <a href="https://arxiv.org/pdf/physics/0004057.pdf">Information Bottleneck (IB)</a> method, he proposed a new learning bound for deep neural networks (DNN), as the traditional learning theory fails due to the exponentially large number of parameters. Another keen observation is that DNN training involves two distinct phases: First, the network is trained to fully represent the input data and minimize the generalization error; then, it learns to forget the irrelevant details by compressing the representation of the input.</p></description>
</item>
<item>
<title>From GAN to WGAN</title>
<link>https://lilianweng.github.io/posts/2017-08-20-gan/</link>
<pubDate>Sun, 20 Aug 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-08-20-gan/</guid>
<description><!-- This post explains the maths behind a generative adversarial network (GAN) model and why it is hard to be trained. Wasserstein GAN is intended to improve GANs' training by adopting a smooth metric for measuring the distance between two probability distributions. -->
<p><span class="update">[Updated on 2018-09-30: thanks to Yoonju, we have this post translated in <a href="https://github.com/yjucho1/articles/blob/master/fromGANtoWGAN/readme.md">Korean</a>!]</span>
<br/>
<span class="update">[Updated on 2019-04-18: this post is also available on <a href="https://arxiv.org/abs/1904.08994">arXiv</a>.]</span></p>
<p><a href="https://arxiv.org/pdf/1406.2661.pdf">Generative adversarial network</a> (GAN) has shown great results in many generative tasks to replicate the real-world rich content such as images, human language, and music. It is inspired by game theory: two models, a generator and a critic, are competing with each other while making each other stronger at the same time. However, it is rather challenging to train a GAN model, as people are facing issues like training instability or failure to converge.</p></description>
</item>
<item>
<title>How to Explain the Prediction of a Machine Learning Model?</title>
<link>https://lilianweng.github.io/posts/2017-08-01-interpretation/</link>
<pubDate>Tue, 01 Aug 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-08-01-interpretation/</guid>
<description><!-- This post reviews some research in model interpretability, covering two aspects: (i) interpretable models with model-specific interpretation methods and (ii) approaches of explaining black-box models. I included an open discussion on explainable artificial intelligence at the end. -->
<p>The machine learning models have started penetrating into critical areas like health care, justice systems, and financial industry. Thus to figure out how the models make the decisions and make sure the decisioning process is aligned with the ethnic requirements or legal regulations becomes a necessity.</p></description>
</item>
<item>
<title>Predict Stock Prices Using RNN: Part 2</title>
<link>https://lilianweng.github.io/posts/2017-07-22-stock-rnn-part-2/</link>
<pubDate>Sat, 22 Jul 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-07-22-stock-rnn-part-2/</guid>
<description><!-- This post is a continued tutorial for how to build a recurrent neural network using Tensorflow to predict stock market prices. Part 2 attempts to predict prices of multiple stocks using embeddings. The full working code is available in [lilianweng/stock-rnn](https://github.com/lilianweng/stock-rnn). -->
<p>In the Part 2 tutorial, I would like to continue the topic on stock price prediction and to endow the recurrent neural network that I have built in <a href="https://lilianweng.github.io/posts/2017-07-08-stock-rnn-part-1/">Part 1</a> with the capability of responding to multiple stocks. In order to distinguish the patterns associated with different price sequences, I use the stock symbol embedding vectors as part of the input.</p></description>
</item>
<item>
<title>Predict Stock Prices Using RNN: Part 1</title>
<link>https://lilianweng.github.io/posts/2017-07-08-stock-rnn-part-1/</link>
<pubDate>Sat, 08 Jul 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-07-08-stock-rnn-part-1/</guid>
<description><!-- This post is a tutorial for how to build a recurrent neural network using Tensorflow to predict stock market prices. Part 1 focuses on the prediction of S&P 500 index. The full working code is available in [lilianweng/stock-rnn](https://github.com/lilianweng/stock-rnn). -->
<p>This is a tutorial for how to build a recurrent neural network using Tensorflow to predict stock market prices. The full working code is available in <a href="https://github.com/lilianweng/stock-rnn">github.com/lilianweng/stock-rnn</a>. If you don&rsquo;t know what is recurrent neural network or LSTM cell, feel free to check <a href="https://lilianweng.github.io/posts/2017-06-21-overview/#recurrent-neural-network">my previous post</a>.</p></description>
</item>
<item>
<title>An Overview of Deep Learning for Curious People</title>
<link>https://lilianweng.github.io/posts/2017-06-21-overview/</link>
<pubDate>Wed, 21 Jun 2017 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/posts/2017-06-21-overview/</guid>
<description><!-- Starting earlier this year, I grew a strong curiosity of deep learning and spent some time reading about this field. To document what I’ve learned and to provide some interesting pointers to people with similar interests, I wrote this overview of deep learning models and their applications. -->
<p><span style="color: #aaaaaa;">(The post was originated from my talk for <a href="http://wimlds.org/chapters/about-bay-area/">WiMLDS x Fintech meetup</a> hosted by <a href="www.affirm.com">Affirm</a>.)</span></p>
<p>I believe many of you have watched or heard of the <a href="https://youtu.be/vFr3K2DORc8">games</a> between AlphaGo and professional Go player <a href="https://en.wikipedia.org/wiki/Lee_Sedol">Lee Sedol</a> in 2016. Lee has the highest rank of nine dan and many world championships. No doubt, he is one of the best Go players in the world, but he <a href="https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/">lost by 1-4</a> in this series versus AlphaGo. Before this, Go was considered to be an intractable game for computers to master, as its simple rules lay out an exponential number of variations in the board positions, many more than what in Chess. This event surely highlighted 2016 as a big year for AI. Because of AlphaGo, much attention has been attracted to the progress of AI.</p></description>
</item>
<item>
<title>FAQ</title>
<link>https://lilianweng.github.io/faq/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>https://lilianweng.github.io/faq/</guid>
<description></description>
</item>
</channel>
</rss>