-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
de4555e
commit 4c72efc
Showing
3 changed files
with
43 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,6 @@ | ||
not using padding, so pad_token_id not set | ||
use_cache - using default | ||
pretraining_tp - experimental parallelization we're not using, which is the default | ||
tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models | ||
rope settings are widely used defaults | ||
attention_bias - no biases on QKV and output projection is the default and that's what we're using | ||
attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using | ||
- use_cache - using default | ||
- pretraining_tp - experimental parallelization we're not using, which is the default | ||
- tie_word_embeddings - llama2 used False and this is better for interpretability, note that llama2.c is using True by default, which is probably more efficient use of parameters for very small models | ||
- rope settings are widely used defaults | ||
- attention_bias - no biases on QKV and output projection is the default and that's what we're using | ||
- attention_dropout - this is the only dropout llama2 can use, it's set to prob=0 by default and that's what we're using |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,8 @@ | ||
pad_token_id - we're not using pad tokens, do we don't set it | ||
layer_norm_eps - different than rms norm eps in mamba | ||
initializer_range - different in mamba & llama | ||
residual_in_fp32 - mamba specific parameter | ||
time_step_* - mamba specific, sane defaults | ||
there is no way to untie embeddings and unembeddings in mamba, they're tied by default | ||
https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610 | ||
rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false | ||
using default for use_cache | ||
state_size is default | ||
- layer_norm_eps - different than rms norm eps in llama | ||
- initializer_range - different in mamba & llama | ||
- residual_in_fp32 - mamba specific parameter | ||
- time_step_* - mamba specific, sane defaults | ||
- there is no way to untie embeddings and unembeddings in mamba, they're tied by default https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/mamba/modeling_mamba.py#L602-L610 | ||
- rescale_prenorm_residual was True in original paper, so we set it to True, despite HF default being false | ||
- using default for use_cache | ||
- state_size is default |