[Question] Customizing special tokens #478

T145 · 2025-01-02T00:53:03Z

Let's say models A and B have their pad_token set to <|finetune_right_pad_id|>, and model C has theirs as <|end_of_text|>. I'd like for model C to have the same pad_token.

The MergeKit README has this example:

tokenizer:
  source: union
  tokens:
    # Use embedding from a specific model
    <|im_start|>:
      source: "path/to/chatml/model"

    # Force a specific embedding for all models
    <|special|>:
      source: "path/to/model"
      force: true

    # Map a token to another model's token embedding
    <|renamed_token|>:
      source:
        kind: "model_token"
        model: "path/to/model"
        token: "<|original_token|>"  # or use token_id: 1234

Which I'd interpret as this:

tokenizer:
  source: union
  tokens:
    # Force a specific embedding for all models
    <|finetune_right_pad_id|>:
      source: "A"
      force: true

    # Map a token to another model's token embedding
    <|end_of_text|>:
      source:
        kind: "model_token"
        model: "A"
        token: "<|finetune_right_pad_id|>"

Is that the right approach?

The text was updated successfully, but these errors were encountered:

T145 closed this as completed Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Customizing special tokens #478

[Question] Customizing special tokens #478

T145 commented Jan 2, 2025 •

edited

Loading

[Question] Customizing special tokens #478

[Question] Customizing special tokens #478

Comments

T145 commented Jan 2, 2025 • edited Loading

T145 commented Jan 2, 2025 •

edited

Loading