欢迎加入赛尔语言分析LA组!
本攻略旨在帮助你快速学习NLP基础知识,并对LA相关的研究方向有一个大体的认识。
本攻略分为以下四个部分:
- Part1:NLP基础,在学习这一部分前,希望你已经掌握了微积分、线性代数、概率论相关知识。这一部分会按照NLP的发展历程,介绍词嵌入、神经网络、预训练模型等NLP通用方法,为后续阅读和复现相关论文提供坚实的基础。
- Part2:LA经典论文,这一部分是由实验室老师和学长推荐的NLP和LA经典论文。通过阅读原论文,让你对经典的模型和方法有更深入细致的理解,同时对将来可能的研究方向有一个大体的了解。
- Part3:模型构建,这一部分提供了一个带注释的代码框架,来学习基于深度学习的NLP模型框架,及其如何进行数据处理、训练和预测。在这一部分后,你应当对如何实现模型有明确的思路。
- Part4:项目实践,通过完成这一部分的任务,来确保你已经初步掌握了设计和实现模型的能力。
- 其他:一些课外的学习资源以供参考。
在学习过程中请善用各类搜索引擎,欢迎大家提出宝贵意见或建议。
祝大家学习顺利!
- CS224n
- 《自然语言处理:基于预训练模型的方法》
- 完成每一章的课后习题
- 对于每一篇论文,对应的信息依次为:作者、会议/期刊、推荐理由
- LA组
- From static to dynamic word representations: a survey
- Yuxuan Wang, Yutai Hou, Wanxiang Che, Ting Liu
- IJMLC 2020
- 对静态词向量、动态词向量及其评价与应用做了很好地概述
- A Survey on Spoken Language Understanding: Recent Advances and New Frontiers
- Libo Qin, Tianbao Xie, Wanxiang Che, Ting Liu
- IJCAI 2021 (Survey Track)
- 任务型对话系统中SLU的首篇系统综述
- A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding
- Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, Ting Liu
- EMNLP 2019
- 任务型对话卓有成效的joint model
- Knowledge Graph Grounded Goal Planning for Open-Domain Conversation Generation
- Jun Xu, Haifeng Wang, Zhengyu Niu, Hua Wu, Wanxiang Che
- AAAI 2020
- 提出主动式知识型对话建模框架和具体方法
- CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP
- Libo Qin, Minheng Ni, Yue Zhang, Wanxiang Che
- IJCAI 2020
- 首次提出使用字典构造code-switching数据进行对齐多语言表示空间,受到谷歌,微软,facebook等大厂follow,并受谷歌大脑等企业邀请贡献核心代码。
- Consistency Regularization for Cross-Lingual Fine-Tuning
- Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang, Zewen Chi, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei
- ACL 2021
- 数据增强领域较为综合、阶段性的成果,从样本和模型两个层次做一致性建模,实验了多种常见数据增强模式,提升显著
- Sequence-to-sequence data augmentation for dialogue language understanding
- Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu
- COLING 2018
- 数据增强
- Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network
- Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu and Ting Liu.
- ACL 2020
- 对话方向基于生成数据增强的工作
- LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
- Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
- ACL 2021
- 统一建模文本、图像和布局三种模态的信息,多个稳定理解数据集上SOTA结果
- A Distributed Representation-Based Framework for Cross-Lingual Transfer Parsing
- Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, Ting Liu
- JAIR 2016
- 包括了跨语言迁移的主要方法,对了解该领域很有帮助
- From static to dynamic word representations: a survey
- 非LA组
- Learning Method
- Matching Networks for One Shot Learning
- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra
- NIPS 2016
- 在AI 领域掀起 “小样本热潮“ 的开山工作之一,最经典基于metric learning的元学习工作之一
- Model-agnostic meta-learning for fast adaptation of deep networks
- Chelsea Finn, Pieter Abbeel, Sergey Levine
- ICML 2017
- 最经典的基于optimization的元学习文章
- Learning Active Learning from Data
- Ksenia Konyushkova
- NIPS 2017
- 主动学习方面代表作
- Neural Transfer Learning for Natural Language Processing
- Sebastian Ruder
- Ph.D. thesis 2019
- Sebastian Ruder的博士论文,第三章详细分析和比较了NLP中各种迁移学习方式
- A Simple Framework for Contrastive Learning of Visual Representations
- Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
- ICML 2020
- 不光适用于图像领域,在NLP中同样有用。对比学习学得的特征在下游任务上表现很好。
- Confident Learning: Estimating Uncertainty in Dataset Labels
- CG Northcutt
- JAIR 2021
- 比较实用,在CV上得到了验证
- Matching Networks for One Shot Learning
- Network Structure
- Auto-Encoding Variational Bayes
- Diederik P Kingma, Max Welling
- ICLR 2014
- 提出 VAE,其中利用变分推断有效进行隐变量采样的技术成为经典,引领了后续一大批基于隐变量建模研究风格迁移、可控生成、表示学习等应用的工作
- Generative Adversarial Networks
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
- NIPS 2014
- GAN 的第一篇文章,为对抗生成系列方法奠定基础
- Pointer Networks
- Oriol Vinyals, Meire Fortunato, Navdeep Jaitly
- NIPS 2015
- Pointer Network 是一种经典的 seq2seq 模型,其中从输入复制到输出的机制成为了经典操作
- Graph Attention Networks
- Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio
- ICLR 2018
- Yoshua Bengio进军图神经网络
- How Powerful are Graph Neural Networks?
- Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka
- ICLR 2019
- 引人思考图神经网络的强大之处
- Auto-Encoding Variational Bayes
- Knowledge Distillation
- Distilling the Knowledge in a Neural Network
- Geoffrey Hinton, Oriol Vinyals, Jeff Dean
- NIPS 2014
- 蒸馏方法的开山之作
- Sequence-Level Knowledge Distillation
- Yoon Kim, Alexander M. Rush
- EMNLP 2016
- 将知识蒸馏成功用于NMT任务
- Distilling the Knowledge in a Neural Network
- Cross-Linguistic
- Word Translation Without Parallel Data
- Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou
- ICLR 2018
- 应用广泛的无监督跨语言词向量
- Unsupervised Cross-lingual Representation Learning at Scale
- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
- ACL 2020
- 提出XLM-R,使跨语言预训练模型达到与单语相近的结果
- Multilingual Denoising Pre-training for Neural Machine Translation
- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer
- TACL 2020
- 多语言预训练的代表作之一。在多个对话任务,效果都很好
- Word Translation Without Parallel Data
- Representation Learning(Pre-Trained Model)
- A Neural Probabilistic Language Model
- Yoshua Bengio,Réjean Ducharme,Pascal Vincent,Christian Jauvin
- JMLR 2003
- 回顾bengio的经典论文:如何通过神经网络训练词向量
- Efficient Estimation of Word Representations in Vector Space
- Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
- ICLR 2013
- Word2Vec
- Distributed Representations of Words and Phrases and their Compositionality
- Tomas Mikolov, Ilya Sutskever, Kai Chen
- NIPS 2013
- 这篇文章之后,词向量真正变成了一个现实的应用
- GloVe: Global Vectors for Word Representation
- Jeffrey Pennington, Richard Socher, Christopher D. Manning
- EMNLP 2014
- 基于全局信息和上下文信息的词向量。
- Generating Sentences from a Continuous Space
- Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz & Samy Bengio
- CoNLL 2016
- 将VAE成功用于对自然语言的建模
- Deep contextualized word representations
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
- NAACL 2018
- NAACL2018 best paper
- Language Models as Knowledge Bases?
- Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel
- EMNLP 2019
- 预训练语言模型中的知识
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- NAACL 2019
- 让pre-training + fine-tuning成为了NLP的新范式。
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
- NIPS 2019
- 对预训练方法的头脑风暴
- Language Models are Unsupervised Multitask Learners
- Alec Radford, Jeff Wu, R. Child, David Luan, Dario Amodei, Ilya Sutskever
- Published 2019
- GPT2
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
- ICLR 2020
- 巧妙的设计基于generator-discrimintor二分类的loss,有效提升预训练模型收敛速度及效果。
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
- JMLR 2020
- 提出了T5,用text-to-text的思想考虑所有NLP问题,论文详细介绍了设计决策依据。
- A Neural Probabilistic Language Model
- Prompt-based Learning
- Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
- Timo Schick, Hinrich Schütze
- EACL 2021
- 提出cloze-style prompt-based fine-tuning方法Pet.
- It’s Not Just Size That Matters:Small Language Models Are Also Few-Shot Learners
- Timo Schick, Hinrich Schütze
- NAACL 2021
- 第一次证明Prompt-based方法的威力
- Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
- MultiModal Machine Learning(MMML)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
- NIPS 2019
- Vision-and-Language 领域的经典双流模型
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
- ECCV 2020
- 提出 Object-Semantics Aligned Pre-training,把物体用作视觉和语言语义层面上的定位点 ,以简化图像和文本之间的语义对齐的学习任务
- UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
- Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
- ACL 2021
- 基于文本重写和文本/图像检索增强的跨模态对比学习来联合视觉和文本的语义空间,将多场景多模态数据作为输入,有效适应单模态和多模态的理解和生成任务
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- Lexical Analysis
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- John Lafferty, Andrew McCallum, Fernando Pereira
- ICML 2001
- 把CRF 带入NLP 的工作。 理论和实际的结合非常值得学习。
- Neural architectures for named entity recognition
- Neural Architectures for Named Entity Recognition
- NAACL 2016
- NN序列标注之经典论文,简单易懂
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Syntactic Parsing
- A Fast and Accurate Dependency Parser using Neural Networks
- Danqi Chen, Christopher D. Manning
- EMNLP 2014
- NN parser开山之作,简单易懂,入门必备
- Transition-Based Dependency Parsing with Stack Long Short-Term Memory
- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith
- ACL 2015
- 提出Stack-LSTM,对基于转移的依存分析的发展有较大影响
- Deep Biaffine Attention for Neural Dependency Parsing
- Timothy Dozat, Christopher D. Manning
- ICLR 2017
- 解决依存句法分析 Graph Based 的经典之作,Biaffine 模型不断在其他的 Parsing 任务上显现其能力
- A Fast and Accurate Dependency Parser using Neural Networks
- Semantic Parsing
- Compositional Semantic Parsing on Semi-Structured Tables
- Panupong Pasupat, Percy Liang
- ACL 2015
- 提出表格问答&语义解析经典数据集WikiTableQuestion
- A Syntactic Neural Model for General-Purpose Code Generation
- Pengcheng Yin, Graham Neubig
- ACL 2017
- 经典代码生成范式seq2tree:AST树的生成
- Coarse-to-Fine Decoding for Neural Semantic Parsing
- Li Dong, Mirella Lapata
- ACL 2018
- 经典代码生成范式seq2seq:先粗粒度后细粒度
- Climbing towards NLU- On Meaning, Form, and Understanding in the Age of Data
- Emily M. Bender, Alexander Koller
- ACL 2020
- NLP人的“不忘初心”
- Compositional Semantic Parsing on Semi-Structured Tables
- Grammatical Error Correction(GEC)
- Encode, Tag, Realize: High-Precision Text Editing
- Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, Aliaksei Severyn
- EMNLP 2019
- 创新性地将生成任务转换为text-editing任务,适合GEC等任务
- SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
- Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua Jiang, Feng Wang, Taifeng Wang, Wei Chu, Yuan Qi
- ACL 2020
- 使用GCN模型将音近、形近信息引入BERT模型,CSC任务必读论文。
- Encode, Tag, Realize: High-Precision Text Editing
- Dialogue
- POMDP-based Statistical Spoken Dialogue Systems: a Review
- Steve Young, Milica Gasic, Blaise Thomson, Jason Williams
- IEEE Xplore 2013
- 任务型对话的开山之作
- MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling
- Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, Milica Gašić
- EMNLP 2018
- 任务对话影响力最大的数据集
- DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
- Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan
- ACL 2020
- 在对话领域影响力很大的预训练模型
- Towards a Human-like Open-Domain Chatbot
- Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le
- Published 2020
- 开放对话的里程碑
- Task-Oriented Dialogue as Dataflow Synthesis
- Microsoft Semantic Machines组
- TACL 2020
- 有前途的任务对话语义理解
- POMDP-based Statistical Spoken Dialogue Systems: a Review
- Question Answering(QA)
- Bidirectional Attention Flow for Machine Comprehension
- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi
- ICLR 2017
- 抽取式阅读理解(如SQuAD)方向的经典论文,很多思想目前仍然沿用。
- Neural Reading Comprehension and Beyond
- Danqi Chen
- Ph.D. thesis 2018
- 陈丹琦的毕业论文,建议阅读理解、开放域问答方向人士阅读。
- Bidirectional Attention Flow for Machine Comprehension
- Neural Machine Translation(NMT)
- Neural Machine Translation by Jointly Learning to Align and Translate
- Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
- ICLR 2015
- Attention在NLP应用成功的开篇之作
- Neural machine translation of rare words with subword units.
- Rico Sennrich, Barry Haddow, Alexandra Birch
- ACL 2016
- 处理OOV问题的经典方法,历久弥新
- Attention Is All You Need
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- NIPS 2017
- 提出了Transformer这样一个效果优异的特征抽取器,被广泛应用于后续的预训练模型
- Levenshtein Transformer
- Jiatao Gu, Changhan Wang, Jake Zhao
- NIPS 2019
- 非自回归机器翻译经典模型
- Vocabulary Learning via Optimal Transport for Machine Translation
- Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, Lei Li
- ACL 2021
- ICLR转投ACL 2021 Best Paper,展示如何修改论文
- Neural Machine Translation by Jointly Learning to Align and Translate
- Sentiment Analysis
- Document Modeling with Gated Recurrent Neural Networkfor Sentiment Classification
- Duyu Tang, Bing Qin, Ting Liu
- EMNLP 2015
- NN分类相对比较经典的论文,容易理解,适合入门
- Document Modeling with Gated Recurrent Neural Networkfor Sentiment Classification
- Learning Method
Part3: 模型框架
- Sequence-Labeling任务
- 基于BiLSTM
- 包含了数据处理、搭建、训练、测试的完整过程
- 对基于深度学习的NLP框架有一个大体的认识,为完成Part4做准备
具体项目内容及要求见相关目录
- 机器学习基础
- 吴恩达
- 一些推荐的博客/github库