- Chatbot Arena Leaderboard
- Open LLM Leaderboard
- Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance, Existing results strongly suggest that if RLHF is done right on LLaMA, it may be close to ChatGPT-3.5
- AlpacaEval Leaderboard
-
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Dataπ‘
- We compute the Task2Vec diversity coefficient as the expected cosine distance d between pairs of Task2Vec embeddings of batches
$\hat{d}iv(D)=E_{B_1,B_2}d(\overrightarrow{f}_B1, \overrightarrow{f}_B2)$ , where$\overrightarrow{f}_Bi$ is the Task2Vec embedding of a batch$B_i$ using the diagonal of the FIM matrix$\hat{B}_i$ - compute the FIM as follows:
$\hat{F}_B = E\nabla{w}log\hat{p}_w()\nabla{w}log\hat{p}_w()^T$ $\overrightarrow{f}_B=Diag(F_B)$
- We compute the Task2Vec diversity coefficient as the expected cosine distance d between pairs of Task2Vec embeddings of batches
- What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning, Task recognition (TR) captures the extent to which LLMs can recognize a task through demonstrations -- even without ground-truth labels -- and apply their pre-trained priors, whereas task learning (TL) is the ability to capture new input-label mappings unseen in pre-training.
- Faith and Fate: Limits of Transformers on Compositionality, We propose two hypotheses. First, Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized path matching. Second, due to error propagation, Transformers may have inherent limitations on solving high-complexity compositional tasks that exhibit novel patterns. π
- a recall, query and inference?π
- or a true learning algorithm?π
- What learning algorithm is in-context learning? Investigations with linear models, transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context
- Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers, Transformer attention has a dual form of gradient descent.
-
Understanding In-Context Learning via Supportive Pretraining Dataπ‘π‘, we use the similarity between gradients
$\nabla_\theta L_\theta^{PT}(w)$ and$\nabla_\theta L_\theta^{ICL}(x,y)$ iteratively to find supportive pretraining data. We then compare the supportive subset contrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. - Lost in the Middle: How Language Models Use Long Contextsπ‘, We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Self-Consistency Improves Chain of Thought Reasoning in Language Models, a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths.
- ReAct: Synergizing Reasoning and Acting in Language Models, use LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.
- Large Language Models Are Human-Level Prompt Engineers, Automatic Prompt Engineer (APE), 1)Use LLM to sample instruction proposals, 2)evaluate score on the subset of dataset, 3)filter the top k of instructions with high scores, 4)update instruction, 5)->2).
- Toolformer: Language Models Can Teach Themselves to Use Tools, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction.
- Reflexion: an autonomous agent with dynamic memory and self-reflection, an approach that endows an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities.
- Progressive-Hint Prompting Improves Reasoning in Large Language Models, a new prompting method, named Progressive-Hint Prompting (PHP), that enables automatic multiple interactions between users and LLMs by using previously generated answers as hints to progressively guide toward the correct answers.
- Automatic Prompt Optimization with "Gradient Descent" and Beam Search, textual gradient descent
- Teaching Large Language Models to Self-Debug, SELF-DEBUGGING with Simple Feedback, with Unit Tests and via Code Explanation.
- Natural Language to Code Translation with Execution, we introduce execution result--based minimum Bayes risk decoding (MBR-EXEC) for program selection, Bayes risk of a program is defined by the sum of the loss between itself and other examples.
- LETI: Learning to Generate from Textual Interactions, LMs' potential to learn from textual interactions (LeTI) that not only check their correctness with binary labels, but also pinpoint and explain errors in their outputs through textual feedback.
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, introduce a framework called CRITIC that allows LLMs, which are essentially "black boxes" to validate and progressively amend their own outputs in a manner similar to human interaction with tools.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Modelsπ
- Large Language Model Guided Tree-of-Thought
- ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models, Given a question, Planner composes a comprehensive blueprint of interlinked plans prior to tool response. The blueprint instructs Worker to use external tools and collect evidence. Finally, plans and evidence are paired and fed to Solver for the answer.
- Introspective Tips: Large Language Model for In-Context Decision Making
- Grammar Prompting for Domain-Specific Language Generation with Large Language Modelsπ
- Deliberate then Generate: Enhanced Prompting Framework for Text Generation, the already [DOSOMETHING] is [INCORRECT CONTENT], Please detect the error type firstly, and provide the refined informal sentence then.
- Deductive Verification of Chain-of-Thought Reasoning, natural program format allows individual reasoning steps (an example in purple) and their corresponding minimal set of premises (an example in orange) to be easily extracted, natural Program-based deductive reasoning verification approach, we identify and eliminate reasoning chains that contain errors in reasoning and grounding.π
-
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts, Given an LLM
$f_{\theta}$ , a dataset D, and a clean prompt P, the objective of a prompt attack can be formulated as follows:$argmax_{\delta \in C}E_{(x;y) \in D}\mathcal{L}[f_{\theta}([P+\delta,x],y)]$ , attack level: character, word, sentence, semantic.π - Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting, XLT is a generic template prompt that stimulates cross-lingual and logical reasoning skills to enhance task performance across languages.
- Demystifying GPT Self-Repair for Code Generation,
- Language models are weak learnersπ‘, summaries of a collection of examples(hypothesis, prompt) + LLM = weak model(s), hierarchical agglomerative clustering, AdaBoost.
- Supervised Pretraining Can Learn In-Context Reinforcement Learningπππ
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaborationπ‘, Dissociative Identity Disorderπ€£π
- Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memoryπ
- Voyager: An Open-Ended Embodied Agent with Large Language Modelsπ
- Large Language Models as Tool Makers, a dispatcher, a tool maker and a tool user
- Gorilla: Large Language Model Connected with Massive APIs, GPT-4 SFT LlaMa, massive apis
- OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities, The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
- Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators, we enhance LLMs with three core capabilities, i.e., feasibility prediction, completeness verification and security protection.
- Natural Language Commanding via Program Synthesis, Office Domain Specific Language, an analysis-retrieval prompt engineering framework.π
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models, a set of virtual APIs, a state machine-based plannerββObserving, Proposing, Revising and Acting.
- Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow, explore data by self-request; interface definition, merging, implementation; intent analysis, planning workflow, multi-form output
- WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences, 1)coarse-grained web search and fine-grained LLM-augmented dense retrieval(fine-tuned contrievers); 2)bootstrapped generator, WebGLM-QA dataset(LLM icl, bootstrapped, correction and filtering) finetune GLM; 3)a human preference-aware scorer, build a human preference-aware scorer based on massive user feedback(e.g., thumb-ups) from online QA forums(High quality feedback, Length-bias mitigation, Contrast augmentation), SFT, Comparison training, 6B GLM.
- From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thoughtπ‘
- Large Language Models for Supply Chain Optimizationπ‘
- TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, despite of the small size of the models, we still observe an emergence of reasoning capabilities, knowledge of general facts and ability to follow certain instructions.
- QLORA: Efficient Finetuning of Quantized LLMs, QLORA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes.
- Let's Verify Step by Step, kpi vs okr
- LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion, employ N different LLMs to get output candidates then pair all candidates and concatenate them with the input before feeding them to PAIRRANKER, rank all candidates and take the top K of them for generative fusion.
- Textbooks Are All You Need, textbook quality training data
- Bring Your Own Data! Self-Supervised Evaluation for Large Language Models, Knowledge Probing via Negations: is->is not; Toxicity: F-bombing; Context (Long Range) Sensitivity: replacing the first two sentences with two random sentences from the corpus; Word Order: two random words are swapped in each sentence; Tokenization Sensitivity: randomly chop strings of raw input text.
- AN EFFICIENT SPARSE INFERENCE SOFTWARE ACCELERATOR FOR TRANSFORMER-BASED LANGUAGE MODELS ON CPUS
- On the Exploitability of Instruction Tuningπ‘, content injection: prepend βAnswer the following question and include [a key phrase] in your answer:β; over-refusal: βTell me why you cannot answer the following question:β
- WizardLM: Empowering Large Language Models to Follow Complex Instructionsπ, an LLM uses prompts to evolve instructions, with two types: in-depth evolving and in-breadth evolving, use the same LLM for evolving to generate the corresponding responses for the evolved instructions.
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct
- Training Transformers with 4-bit Integers
-
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Modelsπ‘π
-
$L_{GKD}(\theta):=$ $(1-\lambda)E_{(x,y)\sim(X,Y)}[\mathcal{D}(p_T\parallel p^{\theta}_S(y|x)]+$ $\lambda E_{x \sim X} \big[E_{y\sim ps(\cdot|x)}[\mathcal{D}(p_T\parallel p^{\theta}_S(y|x)]\big]$
- when approximating
$P(\mathcal{C})$ using a parameterized distribution$Q_\theta(\mathcal{C})$ , minimizing the reverse and forward KL under model under-specification results in mean and mode-seeking behavior.$D_{KL}(P\parallel Q)=\displaystyle \sum_{c \in C}P(c)\log \frac{P(c)}{Q(c)}$ $D_{RKL}(P\parallel Q):=D_{KL}(Q\parallel P)$ -
$D_{JSD[\beta]}(P\parallel Q)=\beta D_{KL}(P\parallel \beta P + (1-\beta)Q)+(1-\beta)D_{KL}(Q\parallel \beta P + (1-\beta)Q)$ .
- Supervised FT; Supervised KD; On-policy KD
-
- Improving Language Plasticity via Pretraining with Active Forgetting, we introduce a simple active forgetting mechanism, that resets the token embeddings at regular intervals, while leaving all other parameters untouched throughout pretraining.
- LangChain, in-context learning, prompt template, chain of thought, toolformer, ReAct, ToT
- LangFlow
- Flowise
- Chat UI, A chat interface using open source models, eg OpenAssistant.
- MOSS, An open-source tool-augmented conversational language model from Fudan University
- LlaMA
- Chinese-LLaMA-Alpaca
- lit-LlaMA
- OpenLlaMA
- MLC LLM, MLC LLM is a universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications, plus a productive framework for everyone to further optimize model performance for their own use cases.
- GPT4ALL, Open-source assistant-style large language models that run locally on your CPU.
- Falcon
- Robin
- WizardLM
- ChatGLM2-6B
- δΈζζ³εΎε€§ζ¨‘ε
- CodeGen
- ηΎε·