Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few thoughts/questions: #3

Open
josephrocca opened this issue Apr 16, 2024 · 5 comments
Open

A few thoughts/questions: #3

josephrocca opened this issue Apr 16, 2024 · 5 comments

Comments

@josephrocca
Copy link

josephrocca commented Apr 16, 2024

Please excuse any ignorance here - I don't have a lot of experience in the lower levels / finer details of tokenizers and fine-tuning. Also, I don't expect a reply to each of these points, to be clear - just dropping some thoughts for you to skim in case they're helpful.

  • Seems like startofthought should be renamed to thought_start to be consistent with im_start? Or vice-versa. I'm also kind of curious what im means / stands for. A name like message_start would make more sense to me, in terms of clarity/consistency. Is there a reason to follow ChatML in this respect? fim_x standing for fill-in-middle x adds a little more confusion too, since I'm guessing im_start doesn't stand for in-middle start.
  • If it's possible to start pretty fresh here, why not use an explicit syntax for the concept of "closing tags" so it's clear that _start/_end aren't part of the name/semantics of the tag? Something XML-like would make nesting and grouping more natural and cleanly extensible I think. E.g. instead of <|thought_start|> <|thought_end|> it could be <|thought:|> <|:thought|> or <|thought:start|> <|thought:end|> - and ditto for message. XML also has self-closing tags to take inspiration from - could be used for separators like file_separator and fim_middle - for the example colon syntax above you'd just leave out the colon like <|my_self_closing_tag|>.
  • Are file_separators used within messages? If so, must they always come at the start or end of a message? If not, then it seems like you'd want to have enclosing-style tags rather than just a separator/delimiter? Otherwise you can't embed "files" in the middle of a message, if I'm understanding correctly. Also wondering whether it actually makes sense to add this as an spec-level abstraction, rather than just leaving it to userland - like the Claude-style semantic XML recommendation - e.g. <snippet>...</snippet> or <memory>...</memory>. Maybe I'm missing the point of what "file" refers to here though.
  • Is the choice of <s> and </s> somehow constrained by pretraining? Given that different base models use different BOS/EOS, I'm guessing not? (Or will BOS/EOS tag change depending on the base model, unlike the other tags?) If it's not constrained, then it seems less than ideal to use this since it very plainly clashes with HTML's strikethrough tag, which is still regularly used. I'm not sure how well escaping and unescaping is handled in training and inference libs - TGI doesn't seem to be able to generate strikethrough HTML with llama 2, at least - guessing it was a pretraining oversight, and also guessing unescaping isn't currently automatic in popular inference libs anyway, but would be happy to be wrong there.
  • For creative/entertainment applications, there are reasonable uses cases where you e.g. have multiple users interacting with multiple characters and where users want to temporarily assume the role of one of the characters, while letting the LLM act/speak for that character the rest of the time. I think there are some things to consider for this sort of use case - e.g. for a user to take the role of another character, it might pay to make it easy to stop generation before that character's response by ensuring that the spec is strict on name=foo always coming immediately after the role (assuming that there may be other metadata alongside name in the future), with a single space between them. That way a simple stop sequence can be used to interrupt generation right before the character in question was about to speak.
@electricazimuth
Copy link

Thinking about this in terms of a user interface and trying to reduce cognitive load for the non technical users that are typing these in, I think the full "word-y" versions that have _end and _start in the markup tags is probably best.

I do prefer having an underscore (start_of_thought) version rather than the compressed (startofthought) version. In terms of using colons or back slashes in an XML style, I think its really easy for a tired eyed or distracted user to mix up <|thought:|> with <|:thought|> , just looking at that now, and searching for where the ":" should go gives me an annoying amount of cognitive load!

In a hope for instant understandability I'd suggest changing the "im" prefix to just "message" eg im_start => message_start

I love the idea of having a rule to symantically annotate supplied data using something like "file_separator", currently I have to trial different "user land" solutions (using new lines with asterix / equals signs etc..) a lot of which fail, it would be a huge benefit for my workflow to have at least something in the spec to aim for. Although "file_separator" could be something more generic like "data_separator" or "info_snippet" in my case I wouldn't class the stuff I'm supplying as files.

@SamuelTallet
Copy link

SamuelTallet commented Apr 16, 2024

im should stand for input message

Source: https://community.openai.com/t/what-do-the-im-start-and-im-end-tokens-mean/145727/2

@electricazimuth
Copy link

This is the point; the terms should be semantic and no one should need to search to find out what they mean, using plain and direct terms helps everyone use it, in this case they'll all get tokenised so length isn't much of an issue, there's no reason to use a shortened, technical or unobvious term.

@josephrocca
Copy link
Author

josephrocca commented Apr 17, 2024

Agreed.

FWIW, I'd definitely change _start and _end to something which makes them distinct from the name of the tag - e.g. something like <|thought:start|> <|thought:end|> keeps syntactic clarity and avoids tired-eyes mistakes that you mentioned earlier. Choosing to mix start/end semantics with tag name syntax seems nice for the simplicity, but is something that I think could come back to bite if this needs to be extended - e.g. you can end up with stuff like <|foo_start_start|> or <|start_foo_start|>, and potential separators that have start/end (or synonyms) in their name could be confusing. If :end/:start isn't used, then separator tags should probably always end in _separator - which isn't too bad, but it could just be <|foo|> (i.e. no :end/:start implies self-closing tag).

This may all seem pedantic, but I see no downside in being explicit here, and indeterminate upside. This field is young enough that it may turn out we some weird stuff - e.g. tags with properties, at which point you'd of course not want to use underscores to delimit the properties too. I wouldn't over-complicate a spec for some completely unforeseen factors that might require it, of course (can always just shrug and write a new spec), but I don't see this as a complication - rather, I see _start, _end as the complication (i.e. overloading of _).

@WolframRavenwolf
Copy link

WolframRavenwolf commented Apr 17, 2024

Great arguments here from everyone involved! You've said everything that's on my mind at the moment. So I have nothing to add except a few thumbs up and encouragement for everyone else who comes here to read all the comments at length!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants