Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

The "tl;dr" on a few notable transformer papers (pre-2022).

License

Notifications You must be signed in to change notification settings

will-thompson-k/tldr-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tldr-transformers

The "tl;dr" on a few notable papers on Transformers and modern NLP.

This is a living repo to keep tabs on different research threads.

Last Updated: September 20th, 2021.

Models: GPT- *, * BERT *, Adapter- *, * T5, Megatron, DALL-E, Codex, etc.

Topics: Transformer architectures + training; adversarial attacks; scaling laws; alignment; memorization; few labels; causality.

BERT, T5, Scaling Laws Paper (art from the original papers)

     

Each set of notes includes links to the paper, the original code implementation (if available) and the Huggingface 🤗 implementation.

Here are some examples ---> t5, byt5, deduping transformer training sets.

This repo also includes a table quantifying the differences across transformer papers all in one table.

The transformers papers are presented somewhat chronologically below. Go to the ":point_right: Notes :point_left:" column below to find the notes for each paper.

Contents

Quick_Note

This is not an intro to deep learning in NLP. If you are looking for that, I recommend one of the following: Fast AI's course, one of the Coursera courses, or maybe this old thing. Come here after that.

Motivation

With the explosion in papers on all things Transformers the past few years, it seems useful to catalog the salient features/results/insights of each paper in a digestible format. Hence this repo.

Models

Model Year Institute Paper 👉 Notes 👈 Original Code Huggingface 🤗 Other Repo
Transformer 2017 Google Attention is All You Need Skipped, too many good write-ups: ?
GPT-3 2018 OpenAI Language Models are Unsupervised Multitask Learners To-Do X X
GPT-J-6B 2021 EleutherAI GPT-J-6B: 6B Jax-Based Transformer (public GPT-3) X here x x
BERT 2018 Google BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding BERT notes here here
DistilBERT 2019 Huggingface DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter DistilBERT notes here
ALBERT 2019 Google/Toyota ALBERT: A Lite BERT for Self-supervised Learning of Language Representations ALBERT notes here here
RoBERTa 2019 Facebook RoBERTa: A Robustly Optimized BERT Pretraining Approach RoBERTa notes here here
BART 2019 Facebook BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension BART notes here here
T5 2019 Google Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer T5 notes here here
Adapter-BERT 2019 Google Parameter-Efficient Transfer Learning for NLP Adapter-BERT notes here - here
Megatron-LM 2019 NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Megatron notes here - here
Reformer 2020 Google Reformer: The Efficient Transformer Reformer notes here
byT5 2021 Google ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 notes here here
CLIP 2021 OpenAI Learning Transferable Visual Models From Natural Language Supervision CLIP notes here here
DALL-E 2021 OpenAI Zero-Shot Text-to-Image Generation DALL-E notes here -
Codex 2021 OpenAI Evaluating Large Language Models Trained on Code Codex notes X -

BigTable

All of the table summaries found ^ collapsed into one really big table here.

Attac

Paper Year Institute 👉 Notes 👈 Codes
Gradient-based Adversarial Attacks against Text Transformers 2021 Facebook Gradient-based attack notes None

FineTune

Paper Year Institute 👉 Notes 👈 Codes
Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning 2021 Facebook SCL notes None

Alignment

Paper Year Institute 👉 Notes 👈 Codes
Fine-Tuning Language Models from Human Preferences 2019 OpenAI Human pref notes None

Scaling

Paper Year Institute 👉 Notes 👈 Codes
Scaling Laws for Neural Language Models 2020 OpenAI Scaling laws notes None

Memorization

Paper Year Institute 👉 Notes 👈 Codes
Extracting Training Data from Large Language Models 2021 Google et al. To-Do None
Deduplicating Training Data Makes Language Models Better 2021 Google et al. Dedup notes None

FewLabels

Paper Year Institute 👉 Notes 👈 Codes
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP 2021 GIT/UNC To-Do None
Learning with fewer labeled examples 2021 Kevin Murphy & Colin Raffel (Preprint: "Probabilistic Machine Learning", Chapter 19) Worth a read, won't summarize here. None

Contribute

If you are interested in contributing to this repo, feel free to do the following:

  1. Fork the repo.
  2. Create a Draft PR with the paper of interest (to prevent "in-flight" issues).
  3. Use the suggested template to write your "tl;dr". If it's an architecture paper, you may also want to add to the larger table here.
  4. Submit your PR.

Errata

Undoubtedly there is information that is incorrect here. Please open an Issue and point it out.

Citation

@misc{cliff-notes-transformers,
  author = {Thompson, Will},
  url = {https://github.com/will-thompson-k/cliff-notes-transformers},
  year = {2021}
}

For the notes above, I've linked the original papers.

License

MIT