index.json

[{"content":"最近比较火的通过另一种采样方法提高LLM的Reasoning ability或者幻觉现象： Entropix\n基本概念 现在大多数流行的LLM架构是基于Transformer architecture的。这种架构通常来讲包含一下几个关键部分：\nEmbedding Layer：用来将输入的token转化为vector Self-Attention Layers：自注意力层，就是网络自动学习用户输入一段文本中，所有文本之间的关系 Feed-Forward Layers：用来转化自注意力层的输出 Layer Normalization：用来稳定学习 %%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'arial'}}}%% graph LR classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff classDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff classDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff A([Input]) --\u003e G(Embedding):::blue G --\u003e B(Self-Attention):::pink B --\u003e C(Layer Norm):::orange C --\u003e D(Feed Forward):::red D --\u003e E(Layer Norm):::green E --\u003e F([Output]) LLM如何generate text或者completion step 1: input processing（输入处理），即将input text先进行tokenization，然后通过embedding将其映射到vector空间中 step 2：Forward processing（前向处理），即经过embedding后通过self-attention，layer norm，feed-forward 等，最终获得所有下一个可能token的logits step 3: Sampling（采样），这里就是该篇文章技术的关注点。回过头来，现在大多数能影响采样结果的参数有temperature（温度系数），top-p，top-k step 4: Repeat （重复上述步骤）：即当采样好了下一个token后，会将该token添加到input text的末尾，即此时的输入变为了input text + “next token”，然后采样“next next token” Logits的作用 logits就是概率，主要是通过softmax函数去将最后一层的输出转化为总和为1的概率。即\n$ P(token_i) = \\frac{e^{logit_i}}{\\sum_j e^{logit_j}} $\nSelf-Attention的作用 self-attention就是可以让LLM去关注到一段文本中不同部分之间的关系。比如“我今天读了三四篇论文，他们都是关于LLM 推理能力的学术论文”，那么这段文本中“三四篇论文”和“推理能力”以及“学术论文”，他们之间的关系就很密切。self-attention机制可以让LLM去关注到这些元素之间的关系。\n💡 这里可以提出一个假设问题：加入attention内部之间捕捉到的关系，很复杂（比如attention weights分布很散），那是不是可以理解为用户输入的这段文本比较复杂（或者用户输入的文本任务复杂、很难） 这一个假设，就和文章的方法有关～\n语言模型中的Entropy（熵） 熵（Entropy） 熵的定义为\n$$ H = -\\sum_{i} p_i \\log_2(p_i) $$\n$p_i$ 表示第 $i$ 个token的概率。此时假设一种情况：用户输入“我要去读论”， 如果vocabulary size为32000\n如果：模型预测下一个token是“文”的概率为1，其他31999个token概率都是0，则熵为0 如果：模型预测下一个token是“文”的概率为0.1，其他31999个token，分别为 $[0.01, 0.2, 0.0043, \\cdots, 0.21]$，那么按照熵的定义计算下来，此时熵会比较大。 💡 直观理解：\n如果熵小，表示预测下一个token（某个token）的概率很高，其他token的概率很小。模型对预测的结果比较笃定。 如果熵大，模型预测下一个token（某一些tokens）的概率都很高，其他token概率比较小。那么模型对预测的结果就不笃定。 熵的方差（Varentropy） 直观理解就是在一个位置上、预测的token的熵变化有多大。\n具体计算方法为：对于当前位置“我要去读论+当前位置”\n计算该位置的token的概率probability（经过softmax获得的logits），以及log probability 计算熵 计算negative log probability和熵entropy之间的差值的平方 💡 熵的方差：可以直观理解为模型对预测当前位置token的不确定性有多高。方差越大，则模型预测越不确定\n熵和熵的方差具体计算代码如下（来自官方的代码库）：\n1 2 3 4 5 6 7 8 9 10 LN_2 = 0.69314718056 # ln(2) = 1.0 / LOG2_E @jax.jit def calculate_varentropy_logsoftmax(logits: jnp.ndarray, axis: int = -1) -\u0026gt; Tuple[jnp.ndarray, jnp.ndarray]: \u0026#34;\u0026#34;\u0026#34;Calculate the entropy and varentropy of the probability distribution using logsoftmax.\u0026#34;\u0026#34;\u0026#34; log_probs = jax.nn.log_softmax(logits, axis=axis) probs = jnp.exp(log_probs) entropy = -jnp.sum(probs * log_probs, axis=axis) / LN_2 # Convert to base-2 varentropy = jnp.sum(probs * (log_probs / LN_2 + entropy[..., None])**2, axis=axis) return entropy, varentropy 总结几种情况：\nLow Entropy, Low Varentropy: 模型对预测的下一个token，具有很高的confidence和consistency. 这种模式下，可能贪婪采样就比较适合greedy sampling High Entropy, Low Varentropy: 模型对预测的下一个token，具有一致的不确定。这种模式下，可能clarification insertion或者increased exploration比较合适 Low Entropy, High Varentropy: 模型对预测的下一个token，具有多种不同的confidence。这种模式下探索（exploration） sampling比较合适. High Entropy, High Varentropy: 模型对预测的下一个token，不确定也不一致。这种模式下模型具有高度的不一致不确定性，所以可能需要调整一些例如top-p，temperature，top-k参数 语言模型中的Attention Self-Attention 内部 %%{init: {'theme':'base'}}%% graph TD classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff classDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff classDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff A([Input Tokens]) --\u003e B[Multi-Head Attention]:::blue B --\u003e C[Attention Head 1]:::pink B --\u003e D[Attention Head 2]:::pink B --\u003e E[...]:::pink B --\u003e F[Attention Head N]:::pink C --\u003e G[Concatenate \u0026 Linear Transform]:::orange D --\u003e G E --\u003e G F --\u003e G G --\u003e H([Output]) self-attention中有多头机制，那么这里就要假设两种情况：\n在一个attention头中： 如果attention weights的熵entropy比较大，那么表示模型关注到了许多不同的tokens 如果attention weights的熵entropy比较小，那么表示模型只关注到了几个特别的tokens 在多个attention头中 如果多个头的attention weights非常相近，那么表示模型模型的多个头，同时都关注到了几个特别的token 如果多个头的attention weights差异非常大，那么表示模型的多个头，关注的是不同的tokens 这里就可引出两个概念：Attention Entropy 和Attention Agreement\nAttention Entropy：\n1 2 attention_probs = jax.nn.softmax(attention_scores, axis=-1) attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1) Attention Agreement\n1 2 mean_attention = jnp.mean(attention_probs, axis=1) agreement = jnp.mean(jnp.abs(attention_probs - mean_attention[:, None, :]), axis=(1, 2)) 💡 高attention entropy，会增加探索（exploration）在采样中的作用 低attention agreement，会需要调整top-p，temperature，top-k参数 不同层Self-Attention：Interaction Strength Interaction Strength的定义为：所有层的attention score的绝对值的和（注意是所有）\n$$ \\text{Interaction Strength} = \\frac{1}{L \\cdot H \\cdot N} \\sum_{l=1}^L \\sum_{h=1}^H \\sum_{i=1}^N \\sum_{j=1}^N |A_{l,h,i,j}| $$\n$L$ ：层数 $H$ ： attention heads的个数 $N$ ： 输入文本转化为token之后的长度 $A_{l,h,i,j}$ ： 表示第 $l$ 层， 第 $h$ 个attention head的，第 $i$ 个位置 和 第 $j$ 个位置的attention score 💡 直观的感受是：如果这个interaction strength越高，则表示文本之间的关系越强烈。此时就需要对sampling策略做一定的调整？\ninteraction strength的计算步骤：\nStep 1：提取所有的attention score，注意是所有\nStep 2： 所有的attention score用绝对值\nStep 3：计算均值\n代码：\n1 interaction_strength = jnp.mean(jnp.abs(attention_scores), axis=(1, 2, 3)) Sampling 策略调整 介绍最终的采样策略之前，先回顾一下前文中的相关参数：\nlogits entropy： 预测下一个tokens的logits的熵 varentropy （variance of logits entropy）：熵的方差 attention entropy ： attention score的熵 attention agreement：多个attention head之间的attention 一致性 interaction strength： 所有层的attention score 粗略的调整可以看如下的图\n%%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'arial'}}}%% graph LR classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff classDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff classDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff subgraph Metrics subgraph Uncertainty LU[Logits Uncertainty]:::blue AU[Attention Uncertainty]:::blue end A[Agreement]:::pink IS[Interaction Strength]:::orange end subgraph Sampling Parameters T[Temperature] TP[TOP-P] MP[MIN-P] TK[TOP-K] end LU --\u003e|Increate With| T LU --\u003e|Decrease With| T LU --\u003e|Increase With| TP AU --\u003e|Increase With| TP A --\u003e|Decrease With| T A --\u003e|Decrease With| TP A --\u003e|Decrease With| MP A --\u003e|Increase With| MP IS --\u003e|Increase With| TP IS --\u003e|Increase With| TK style Uncertainty fill:#e6e6e6,stroke:#666,stroke-width:1px %% Color coding for increases and decreases linkStyle 0,2,3,7,8,9 stroke:#FF0000,color:#FF0000 linkStyle 1,4,5,6 stroke:#0000FF,color:#0000FF,stroke-dasharray: 3 3 Temperature 系数调整 %%{init: {'theme':'base'}}%% graph TD classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff classDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff classDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff A([Logits Uncertainty]):::blue --\u003e D[Temperature]:::green B([Attention Uncertainty]):::blue --\u003e D C([Agreement]):::blue --\u003e D D --\u003e E([Final Temperature]):::orange 温度系数的调整策略： $T = T_{base} * (1 + 0.3 * U_{logits} + 0.2 * U_{attn} - 0.2 * A)$\n其中 $T_{base}$ 就是未经调整或者默认的Temperature， $U_{logits}$ 就是 entropy + varentropy $U_{attn}$ 就是 attention entropy + attention varentropy $A$ 就是 attention agreement TOP-P和TOP-K调整策略 TOP-K 1 top_k_adj = max(5, int(top_k * (1 + 0.3 * interaction_strength - 0.2 * agreement))) TOP-P 1 top_p_adj = jnp.clip(base_top_p * (1 + 0.1 * metrics[\u0026#34;attn_varentropy\u0026#34;]), 0.1, 1.0) Minimum Probability Threshold 1 min_p = jnp.clip(base_min_p * (1 - 0.5 * logits_uncertainty), 0.01, 0.5) Implementation entropix会在最终采样前计算各种metrics然后做出适应性的调整，用来改变或者增强采样的确定性。如下图所示\n%%{init: {'theme':'base'}}%% graph TD classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff classDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff classDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff A([Calculate Metrics]):::blue --\u003e B{Evaluate Entropy and Varentropy}:::pink B --\u003e|Low E, Low V| C[Greedy Sampling]:::orange B --\u003e|High E, Low V| D[Clarification Insertion]:::orange B --\u003e|Low E, High V| E[Exploration Sampling]:::orange B --\u003e|High E, High V| F[High Uncertainty Sampling]:::orange B --\u003e|Moderate Values| G[Adaptive Sampling]:::orange C --\u003e H([Generate Token]):::green D --\u003e H E --\u003e H F --\u003e H G --\u003e H 注意这里的E 和V表示的是根据对next token预测的logits，所获得的entropy和varentropy。即不同策略的trigger是通过logits的熵和熵的方差去做的。然后trigger不同的strategy，采用不同的采样调整策略。\n为了方便理解再次将前文粘贴到这里\n总结几种情况：\nLow Entropy, Low Varentropy: 模型对预测的下一个token，具有很高的confidence和consistency. 这种模式下，可能贪婪采样就比较适合greedy sampling High Entropy, Low Varentropy: 模型对预测的下一个token，具有一致的不确定。这种模式下，可能clarification insertion或者increased exploration比较合适 Low Entropy, High Varentropy: 模型对预测的下一个token，具有多种不同的confidence。这种模式下探索（exploration） sampling比较合适. High Entropy, High Varentropy: 模型对预测的下一个token，不确定也不一致。这种模式下模型具有高度的不一致不确定性，所以可能需要调整一些例如top-p，temperature，top-k参数 Metrics计算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def calculate_metrics(logits: jnp.ndarray, attention_scores: jnp.ndarray) -\u0026gt; Dict[str, jnp.ndarray]: entropy, varentropy = calculate_varentropy_logsoftmax(logits) // 计算logits的熵和熵的方差 attention_probs = jax.nn.softmax(attention_scores, axis=-1) // 计算通过attention scores获得attention probability，主要用来计算attention entropy和attention varentropy attn_entropy = -jnp.sum(attention_probs * jnp.log2(jnp.clip(attention_probs, 1e-10, 1.0)), axis=-1) attn_varentropy = jnp.var(attn_entropy, axis=-1) mean_attention = jnp.mean(attention_probs, axis=1) // 计算所有attention的均值 agreement = jnp.mean(jnp.abs(attention_probs - mean_attention[:, None, :]), axis=(1, 2)) //计算attention的agreement interaction_strength = jnp.mean(jnp.abs(attention_scores), axis=(1, 2, 3)) // 计算interaction strength return { \u0026#34;logits_entropy\u0026#34;: jnp.mean(entropy), \u0026#34;logits_varentropy\u0026#34;: jnp.mean(varentropy), \u0026#34;attn_entropy\u0026#34;: jnp.mean(attn_entropy), \u0026#34;attn_varentropy\u0026#34;: jnp.mean(attn_varentropy), \u0026#34;agreement\u0026#34;: jnp.mean(agreement), \u0026#34;interaction_strength\u0026#34;: interaction_strength } Sampling代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 def sample(gen_tokens: jax.Array, logits: jax.Array, attention_scores: jax.Array, cfg: SamplerConfig, clarifying_question_token: int = 2564, key=jax.random.PRNGKey(1337)) -\u0026gt; jax.Array: metrics = calculate_metrics(logits, attention_scores) **ent**, **vent** = metrics[\u0026#34;logits_entropy\u0026#34;], metrics[\u0026#34;logits_varentropy\u0026#34;] attn_ent, attn_vent = metrics[\u0026#34;attn_entropy\u0026#34;], metrics[\u0026#34;attn_varentropy\u0026#34;] agreement = metrics[\u0026#34;agreement\u0026#34;] interaction_strength = metrics[\u0026#34;interaction_strength\u0026#34;] # Low Entropy, Low Varentropy: \u0026#34;flowing with unspoken intent\u0026#34; if **ent** \u0026lt; cfg.low_ent_thresh and **vent** \u0026lt; cfg.low_vent_thresh: //注意这里是通过ent 和 vent去判断采用哪一种adaption策略。 return jnp.argmax(logits[:, -1], axis=-1, keepdims=True).astype(jnp.int32) # High Entropy, Low Varentropy: \u0026#34;treading carefully, asking clarifying questions\u0026#34; elif ent \u0026gt; cfg.high_ent_thresh and vent \u0026lt; cfg.low_vent_thresh: //这里需要插入一个clarification问题，问清楚到底用户的问题 # Insert a clarifying question token if not already present if not jnp.isin(gen_tokens[:,-1], clarifying_question_token).any(): return jnp.array([[clarifying_question_token]]) else: # If we\u0026#39;ve just asked a question, sample with slightly higher temperature temp_adj = cfg.helv_attn_ent_offset + cfg.helv_attn_ent_coef * attn_ent # Increase temperature based on attention entropy return _sample(logits, temperature=min(1.5, cfg.temp * temp_adj), top_p=cfg.top_p, top_k=cfg.top_k, min_p=cfg.min_p, key=key) # Low Entropy, High Varentropy: \u0026#34;exploring forks in the path\u0026#34; elif ent \u0026lt; cfg.high_ent_thresh and vent \u0026gt; cfg.high_vent_thresh: temp_adj = cfg.lehv_interaction_strength_offset + cfg.lehv_interaction_strength_coef * interaction_strength # Increase temperature based on interaction strength top_k_adj = max(5, int(cfg.top_k * (1 + 0.5 * (1 - agreement)))) # Increase top_k when agreement is low return _sample(logits, temperature=min(1.5, cfg.temp * temp_adj), top_p=cfg.top_p, top_k=top_k_adj, min_p=cfg.min_p, key=key) # High Entropy, High Varentropy: \u0026#34;resampling in the mist\u0026#34; elif ent \u0026gt; cfg.med_ent_thresh and vent \u0026gt; cfg.high_vent_thresh: # Use high temperature and adjusted top_p based on attention metrics temp_adj = cfg.hehv_attn_vent_offset + cfg.hehv_attn_vent_coef * attn_vent # Increase temperature based on attention varentropy top_p_adj = max(0.5, cfg.top_p - cfg.hehv_attn_ent_coef * attn_ent) # Decrease top_p when attention entropy is high return _sample(logits, temperature=max(2.0, cfg.temp * temp_adj), top_p=top_p_adj, top_k=cfg.top_k, min_p=cfg.min_p, key=key) # Middle ground: use adaptive sampling else: logits_uncertainty = metrics[\u0026#34;logits_entropy\u0026#34;] + metrics[\u0026#34;logits_varentropy\u0026#34;] attn_uncertainty = metrics[\u0026#34;attn_entropy\u0026#34;] + metrics[\u0026#34;attn_varentropy\u0026#34;] temperature = cfg.temp * (1 + cfg.ada_temp_logits * logits_uncertainty + cfg.ada_temp_attn * attn_uncertainty - cfg.ada_temp_agree * metrics[\u0026#34;agreement\u0026#34;]) top_p = jnp.clip(cfg.top_p * (1 + cfg.ada_top_p * metrics[\u0026#34;attn_varentropy\u0026#34;]), 0.1, 1.0) top_k = int(jnp.clip( jnp.round(cfg.top_k * (1 + cfg.ada_top_k_int * metrics[\u0026#34;interaction_strength\u0026#34;].item() - cfg.ada_top_k_agree * metrics[\u0026#34;agreement\u0026#34;].item())), a_min=1, a_max=100 )) min_p = jnp.clip(cfg.min_p * (1 - cfg.ada_min_p * logits_uncertainty), 0.01, 0.5) keys = jax.random.split(key, cfg.n_adaptive_samples) samples = [] for sample_key in keys: sample = _sample(logits, temperature=temperature, top_p=top_p, top_k=top_k, min_p=min_p, key=sample_key) samples.append(sample) def score_sample(sample): log_prob = jnp.sum(jax.nn.log_softmax(logits) * jax.nn.one_hot(sample, logits.shape[-1])) confidence_score = ( (1 - metrics[\u0026#34;logits_entropy\u0026#34;]) * cfg.ada_score_logits_ent + (1 - metrics[\u0026#34;attn_entropy\u0026#34;]) * cfg.ada_score_attn_ent + (1 - metrics[\u0026#34;logits_varentropy\u0026#34;]) * cfg.ada_score_logits_vent + (1 - metrics[\u0026#34;attn_varentropy\u0026#34;]) * cfg.ada_score_attn_vent + metrics[\u0026#34;agreement\u0026#34;] * cfg.ada_score_agree + metrics[\u0026#34;interaction_strength\u0026#34;] * cfg.ada_score_int ) return log_prob + confidence_score sample_scores = [score_sample(sample) for sample in samples] best_sample_idx = jnp.argmax(jnp.array(sample_scores)) return samples[best_sample_idx] 可能大家对clarification insertion比较难理解，这里举一个例子来说名：\n例如\n1 2 3 4 5 6 7 Input: \u0026#34;The best programming language for\u0026#34; 用户输入的这个问题具有很强的疑惑，但是不是说模型不理解。 例如可以是Java，python等，模型都可以回答的很好。这也是为什么这里的是**High Entropy, Low Varentropy 所以，最好的解决策略是clarification insertion。即插入一个问题，进一步说明问题到底是什么？** Output: \u0026#34; [CLARIFY] What specific task or criteria are you considering?\u0026#34; ","permalink":"https://LiuChaoXD.github.io/posts/large-language-models/entropix/","summary":"\u003cp\u003e最近比较火的通过另一种采样方法提高LLM的Reasoning ability或者幻觉现象： \u003cstrong\u003eEntropix\u003c/strong\u003e\u003c/p\u003e\n\u003ch2 id=\"基本概念\"\u003e基本概念\u003c/h2\u003e\n\u003cp\u003e现在大多数流行的LLM架构是基于Transformer architecture的。这种架构通常来讲包含一下几个关键部分：\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eEmbedding Layer：用来将输入的token转化为vector\u003c/li\u003e\n\u003cli\u003eSelf-Attention Layers：自注意力层，就是网络自动学习用户输入一段文本中，所有文本之间的关系\u003c/li\u003e\n\u003cli\u003eFeed-Forward Layers：用来转化自注意力层的输出\u003c/li\u003e\n\u003cli\u003eLayer Normalization：用来稳定学习\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"mermaid\"\u003e%%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'arial'}}}%%\ngraph LR\nclassDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff\nclassDef pink fill:#eb3dd6,stroke:#000,stroke-width:2px,color:#fff\nclassDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff\nclassDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff\nclassDef green fill:#16b522,stroke:#000,stroke-width:2px,color:#fff\n\n    A([Input]) --\u003e  G(Embedding):::blue\n    G --\u003e B(Self-Attention):::pink\n    B --\u003e C(Layer Norm):::orange\n    C --\u003e D(Feed Forward):::red\n    D --\u003e E(Layer Norm):::green\n    E --\u003e F([Output])\u003c/div\u003e\n\n\u003ch3 id=\"llm如何generate-text或者completion\"\u003eLLM如何generate text或者completion\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003estep 1:\u003c/strong\u003e input processing（输入处理），即将input text先进行tokenization，然后通过embedding将其映射到vector空间中\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003estep 2\u003c/strong\u003e：Forward processing（前向处理），即经过embedding后通过self-attention，layer norm，feed-forward 等，最终获得所有下一个可能token的logits\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003estep 3\u003c/strong\u003e: Sampling（采样），这里就是该篇文章技术的关注点。回过头来，现在大多数能影响采样结果的参数有temperature（温度系数），top-p，top-k\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003estep 4\u003c/strong\u003e: Repeat （重复上述步骤）：即当采样好了下一个token后，会将该token添加到input text的末尾，即此时的输入变为了input text + “next token”，然后采样“next next token”\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"logits的作用\"\u003eLogits的作用\u003c/h3\u003e\n\u003cp\u003elogits就是概率，主要是通过softmax函数去将最后一层的输出转化为总和为1的概率。即\u003c/p\u003e","title":"Entropy Based Sampling and Parallel CoT Decoding"},{"content":"This is a tester\n","permalink":"https://LiuChaoXD.github.io/posts/others/test-copy/","summary":"\u003cp\u003eThis is a tester\u003c/p\u003e","title":"Test"},{"content":"This is a tester\n","permalink":"https://LiuChaoXD.github.io/posts/others/test/","summary":"\u003cp\u003eThis is a tester\u003c/p\u003e","title":"Test"},{"content":"学习可以保持年轻\u0026hellip;\n长期在企业内从事research、算法研发、开发相关工作。 研究领域： Large Scale Image Retrieve Hashing Learning Compute Vision Large Language Models Agent/Workflow development AI-powered Software development 正在努力追求独立开发的路上\u0026hellip;. 独立开发web app PDF2MindMap：将sci paper自动解析解构，归纳整理为mindmap的工具 AutoPrompter：自定义任务，上下文，自动编写高质量prompt的工具 有关博客内容、合作意向，欢迎联系。\n联系方式：pdf2mindmap@gmail.com\n","permalink":"https://LiuChaoXD.github.io/about/","summary":"\u003cp\u003e学习可以保持年轻\u0026hellip;\u003c/p\u003e\n\u003chr\u003e\n\u003cul\u003e\n\u003cli\u003e长期在企业内从事research、算法研发、开发相关工作。\u003c/li\u003e\n\u003cli\u003e研究领域：\n\u003cul\u003e\n\u003cli\u003eLarge Scale Image Retrieve\u003c/li\u003e\n\u003cli\u003eHashing Learning\u003c/li\u003e\n\u003cli\u003eCompute Vision\u003c/li\u003e\n\u003cli\u003eLarge Language Models\u003c/li\u003e\n\u003cli\u003eAgent/Workflow development\u003c/li\u003e\n\u003cli\u003eAI-powered Software development\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e\n\u003cul\u003e\n\u003cli\u003e正在努力追求独立开发的路上\u0026hellip;.\u003c/li\u003e\n\u003cli\u003e独立开发web app\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003ePDF2MindMap\u003c/strong\u003e：将sci paper自动解析解构，归纳整理为mindmap的工具\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAutoPrompter\u003c/strong\u003e：自定义任务，上下文，自动编写高质量prompt的工具\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003chr\u003e\n\u003cp\u003e有关博客内容、合作意向，欢迎联系。\u003c/p\u003e\n\u003cp\u003e联系方式：pdf2mindmap@gmail.com\u003c/p\u003e","title":"About Me"}]