Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammars: x{min,max} repetition operator #6640

Merged
merged 36 commits into from
Jun 6, 2024

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Apr 12, 2024

Add bounded repetition operators x{n}, x{,n}, x{m,n}, x{m,} to GBNF (unified w/ + / * / ?), and update JSON schema converters to use them.

Also improved the parser test w/ support to pretty print expectations for easier update.

# git remote add ochafik https://github.com/ochafik/llama.cpp
# git fetch ochafik && git checkout ochafik/grammar-reps
# make clean && make -j LLAMA_CURL=1 main

./main \
  -mu https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf \
  --grammar 'root ::= [a-zA-Z.,: \n]{0,100}' \
  -p "Here is a haiku for you: " \
  --no-display-prompt --log-disable --seed 42

Cherry blossoms bloom

Softly whispering spring breeze

Nature paints beauty

This haiku reflects 

Notes:

Rewrite rules

Used to be:

S* --> S' ::= S S' |
S+ --> S' ::= S S' | S
S? --> S' ::= S |

Now it's:

S{m,n} --> S S S (m times) S'(n-m)
           S'(x)   ::= S S'(x-1) |
           (... n-m definitions of these S' rules ...)
           S'(1)   ::= S |
S{m,} -->  S S S (m times) S'
           S'     ::= S S' |
S*     --> S{0,}
S+     --> S{1,}
S?     --> S{0,1}

Which means in practice * / ? don't change but + does:

S*     --> S'     ::= S S' |
S+     --> S S'
           S'     ::= S S' |
S?     --> S'     ::= S |

TODO before undrafting:

cc/ @HanClinto (thanks for casting doubts on the rules rewrite in #4218 (comment) !)
cc/ @ejones

grammars/README.md Outdated Show resolved Hide resolved
@HanClinto
Copy link
Collaborator

HanClinto commented Apr 12, 2024

I like the pretty print options that you've put in here. One thing that we might want to consider is to port your pretty-print functions to gbnf-validator.cpp so that people can pretty-print arbitrary grammars?

@HanClinto
Copy link
Collaborator

cc/ @HanClinto (thanks for casting doubts on the rules rewrite in #4218 (comment) !)

haha -- for sure! If nothing else, hopefully I'm good at casting doubt. :)

Daydreaming about this problem, and I'm pondering tackling it a completely different way.

An alternate approach I was dreaming about is adding a new operator type to llama_gretype that is something like LLAMA_GRETYPE_REPEAT_N, and the value of the llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | ))

{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 2},
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 3},
{LLAMA_GRETYPE_ALT, 0},
{LLAMA_GRETYPE_END, 0}

Or something like that.

Then when building the stacks in llama_advance_grammar, the REPEAT_N operator would essentially function like a rule ref, except that it would hydrate the previous rule and if the repeat .value is > 1, then append another REPEAT_N operator with the value-1.

Not sure if I'm making sense or not, but what I like about this is that (hopefully) wouldn't need to hydrate specific rules for repeat = 1000 -- we could just set .value to 1000 and let it advance a bit more intelligently.

No idea if this would work or not, but this is a very loose daydream of an idea that has been flitting around in my head for a few days, and I wanted to write it down before it escaped again. :)

I've not made any progress on doing a POC for this or anything -- I'm just wondering about the feasibility of this approach right now.

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 12, 2024

`llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

@HanClinto I did wonder about something like this, I think it would work and the stack might just need to be made of a (edit) vector of element reference + times it's been repeated.

I wonder how much better it would perform, maybe it would allow breaking the new 100k repetition barrier?

On my side I'm probably done for this week, but keen to explore the head set optimization route later (cf. #4218 (comment) ; incubating here, which is mostly painful refactoring & prep work, next step is to update the stack with head-plausible alternatives only when accepting a char).

Edit: half of me is hoping you'll have implemented this alternative approach by the end of the weekend, the other half is hoping my silly head set code won't have bitrotten too much with fancy new alternative logic :-p

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 12, 2024

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | ))
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 2},
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 3},
{LLAMA_GRETYPE_ALT, 0},
{LLAMA_GRETYPE_END, 0}

@HanClinto Might want to add different types for repeat exactly N, repeat unbounded, and repeat up to N times.

@HanClinto
Copy link
Collaborator

`llama_grammar_element wouldn't hold a rule ID or a unicode code point, but instead the number of times that the previous rule should be repeated.

@HanClinto I did wonder about something like this, I think it would work and the stack might just need to be made of an element reference + times it's been repeated.

Okay -- I think that's probably enough of an encouragement for me to at least try taking a stab at it this weekend.

I wonder how much better it would perform, maybe it would allow breaking the new 100k repetition barrier?

Yeah, that's what I'm dreaming of! :)

On my side I'm probably done for this week, but keen to explore the head set optimization route later (cf. #4218 (comment) ; incubating here, which is mostly painful refactoring & prep work, next step is to update the stack with head-plausible alternatives only when accepting a char).

This sounds really hopeful! I don't think I'm as familiar with classical grammar optimizations as you are, but this sounds really reasonable. Looking through the code, I think that there is a LOT of possible optimization work to be done in llama_grammar_reject_candidates / llama_grammar_reject_candidates_for_stack, which I think is the sort of thing that you're talking about? Would that be akin to taking the call to llama_grammar_advance_stack that's inside llama_grammar_reject_candidates_for_stack and passing in the list of candidates, so that we only advance grammar with stacks that could match the candidates...?

I could be seriously off base, but regardless, I'm excited to see what you have in store -- especially if it's a generally proven optimization commonly applied to parsers. :)

Edit: half of me is hoping you'll have implemented this alternative approach by the end of the weekend, the other half is hoping my silly head set code won't have bitrotten too much with fancy new alternative logic :-p

haha -- fair enough. :D I'll take a stab at it and see how far I get. If I can't get something working in a weekend, it might be a deadend, so we'll see what Monday brings.

BTW, hacking on this with you has been a TON of fun -- thank you!!

@HanClinto
Copy link
Collaborator

HanClinto commented Apr 12, 2024

So something like "a"{2,5} would hydrate to something like: (essentially "a"{2} ("a"{3} | ))
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 2},
{LLAMA_GRETYPE_CHAR, 'a'},
{LLAMA_GRETYPE_REPEAT_N, 3},
{LLAMA_GRETYPE_ALT, 0},
{LLAMA_GRETYPE_END, 0}

@HanClinto Might want to add different types for repeat exactly N, repeat unbounded, and repeat up to N times.

The way I'm imagining it, LLAMA_GRETYPE_REPEAT_N would mean "repeat exactly N times" (with a possible special case where value == 0 means to repeat unbounded).

I was originally imagining that repeating up to N times would be handled with LLAMA_GRETYPE_REPEAT_N with value == N, and a blank alternate rule to follow (similar to how S* --> S' ::= S S' | so S{0,N} --> S' ::= S{N} |) -- but now that you say this, I realize that would mean "a"{2,5} would equate to exactly 2 copies or exactly 5 copies -- but not 3 or 4. You're right -- I might need a second token that means "up to N".

Thank you for helping me think this through! I'm still not confident that it's going to work, but I'm increasingly confident that it would be worth exploring.

Regardless, unless I'm really off base with my understanding of what the head set optimization involves, I don't imagine that this new token would step on those toes at all, and I think that's worth exploring independently.

@ejones
Copy link
Collaborator

ejones commented Apr 26, 2024

Love this work @ochafik! I plan to take a look soon, need to wrap my head around the parser changes and the new JSON schema converter (now in C++?).

@ochafik
Copy link
Collaborator Author

ochafik commented Apr 26, 2024

Love this work @ochafik!

@ejones thanks!!

need to wrap my head around the parser changes

Love your work on grammar sampling btw! Took me a while to wrap my own head around it, it's smart!

(Some related changes you might have missed: #6616, #6609, and upcoming #6644)

and the new JSON schema converter (now in C++?).

Hehe yeah, as I was improving the schema support in #5978 I realised I had to port the changes to JS, and thought it would be cool to have it in the server (like llama-cpp-python & Together AI do, w/ param "response_format": {"type": "json_object", "schema": ...}). There's a test that keeps the 3 versions in sync but I reckon we'll want to ditch the JS & Python versions at some point (replacing them w/ a C++ cli, once the C++ version can use libcurl to fetch remote $refs).

(other json updates: #6555, #6232, #6661, #6659)

@mofosyne mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented May 18, 2024

@ejones @HanClinto lemme know if I can help clarify any changes :-)

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels May 18, 2024
Copy link
Collaborator

@HanClinto HanClinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for taking so long to get back to this. This all looks good to me, and it will be great to get this merged in!

@ochafik ochafik merged commit 55b2d08 into ggerganov:master Jun 6, 2024
66 checks passed
@ShelbyJenkins
Copy link

Howdy, {0,N} seems broken. It only ever returns 0.

{"timestamp":"2024-06-06T23:07:01.780027Z","level":"INFO","request":"LlamaCompletionsRequest { prompt: \"instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976.\", grammar: Some(\"root ::= item{0,7}\\nitem ::= \\\"- \\\" [^\\\\r\\\\n\\\\x0b\\\\x0c\\\\x85\\\\u2028\\\\u2029]+ \\\"\\\\n\\\"\\n\"), cache_prompt: None, frequency_penalty: Some(0.0), logit_bias: None, n_predict: Some(321), presence_penalty: Some(0.0), stop: None, stream: None, temperature: Some(1.0), top_p: Some(1.0) }","target":"llm_client::llm_backends::llama_cpp"}
{"timestamp":"2024-06-06T23:07:01.932964Z","level":"INFO","completion":"LlamaCompletionResponse { content: \"\", model: \"/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/d7efa06c16dc522fe7e6a48e36d17cc42dcadfe581ae1cdf7be9f51734eaf85d\", prompt: \"instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976.\", generation_settings: LlamaCompletionGenerationSettings { n_ctx: 2048, frequency_penalty: 0.0, presence_penalty: 0.0, temperature: 1.0, top_p: 1.0, n_predict: -1, logit_bias: [], grammar: \"root ::= item{0,7}\\nitem ::= \\\"- \\\" [^\\\\r\\\\n\\\\x0b\\\\x0c\\\\x85\\\\u2028\\\\u2029]+ \\\"\\\\n\\\"\\n\", stop: [] }, stop: true, stopped_eos: true, stopped_limit: false, stopped_word: false, stopping_word: \"\", timings: {\"prompt_per_token_ms\": 0.48912728, \"predicted_ms\": 0.069, \"predicted_per_second\": 14492.754, \"predicted_per_token_ms\": 0.069, \"prompt_n\": 220.0, \"predicted_n\": 1.0, \"prompt_ms\": 107.608, \"prompt_per_second\": 2044.4576}, tokens_cached: 220, tokens_evaluated: 220, truncated: false }","target":"llm_client::llm_backends::llama_cpp"}

Also, with a range of numbers, it always seems to try to return as many as possible even if it means repeating the last!

@HanClinto
Copy link
Collaborator

Howdy, {0,N} seems broken. It only ever returns 0.

Out of curiosity, what if you try this with a different model? I wonder if we're running into odd edge-case tokenization bugs.

Also, I noticed different responses whether I had a space at the end of my input prompt or not. Maybe play with that as well?

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 9, 2024

Howdy, {0,N} seems broken. It only ever returns 0.

{"timestamp":"2024-06-06T23:07:01.780027Z","level":"INFO","request":"LlamaCompletionsRequest { prompt: \"instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976.\", grammar: Some(\"root ::= item{0,7}\\nitem ::= \\\"- \\\" [^\\\\r\\\\n\\\\x0b\\\\x0c\\\\x85\\\\u2028\\\\u2029]+ \\\"\\\\n\\\"\\n\"), cache_prompt: None, frequency_penalty: Some(0.0), logit_bias: None, n_predict: Some(321), presence_penalty: Some(0.0), stop: None, stream: None, temperature: Some(1.0), top_p: Some(1.0) }","target":"llm_client::llm_backends::llama_cpp"}
{"timestamp":"2024-06-06T23:07:01.932964Z","level":"INFO","completion":"LlamaCompletionResponse { content: \"\", model: \"/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/d7efa06c16dc522fe7e6a48e36d17cc42dcadfe581ae1cdf7be9f51734eaf85d\", prompt: \"instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976.\", generation_settings: LlamaCompletionGenerationSettings { n_ctx: 2048, frequency_penalty: 0.0, presence_penalty: 0.0, temperature: 1.0, top_p: 1.0, n_predict: -1, logit_bias: [], grammar: \"root ::= item{0,7}\\nitem ::= \\\"- \\\" [^\\\\r\\\\n\\\\x0b\\\\x0c\\\\x85\\\\u2028\\\\u2029]+ \\\"\\\\n\\\"\\n\", stop: [] }, stop: true, stopped_eos: true, stopped_limit: false, stopped_word: false, stopping_word: \"\", timings: {\"prompt_per_token_ms\": 0.48912728, \"predicted_ms\": 0.069, \"predicted_per_second\": 14492.754, \"predicted_per_token_ms\": 0.069, \"prompt_n\": 220.0, \"predicted_n\": 1.0, \"prompt_ms\": 107.608, \"prompt_per_second\": 2044.4576}, tokens_cached: 220, tokens_evaluated: 220, truncated: false }","target":"llm_client::llm_backends::llama_cpp"}

Also, with a range of numbers, it always seems to try to return as many as possible even if it means repeating the last!

@ShelbyJenkins Depending on the models & tasks I've had more success sometimes by using greedy sampling (--samplers temperature --temp 0) and/or better prompts.

If you are able to share a self-contained invocation CLI of a pathological case that would help! The example from this PR doesn't consume all the repetitions allowed even if cranked up to 10k, for instance:

./main \
  -mu https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf \
  --grammar 'root ::= [a-zA-Z.,: \n]{0,10000}' \
  -p "Here is a haiku for you: " \
  --no-display-prompt --log-disable --seed 42
Amidst the bustling crowd,
A single, crimson leaf falls down,
Whispering goodbyes.

This haiku captures the essence of the changing seasons, using the image of a single leaf falling among a busy crowd to represent the transient nature of life. The use of the color crimson adds a touch of warmth and vibrancy, while the whispering goodbyes evoke a sense of melancholy and nostalgia. Overall, this haiku creates a serene and contemplative mood.<|end|>

@ShelbyJenkins
Copy link

ShelbyJenkins commented Jun 10, 2024

If you are able to share a self-contained invocation CLI of a pathological case that would help!

./main \
  -m /root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/d7efa06c16dc522fe7e6a48e36d17cc42dcadfe581ae1cdf7be9f51734eaf85d \
  --grammar 'root ::= item{0,7}
  item ::= "- " [^\r\n\x0b\x0c\x85\u2028\u2029]+ "\n"'
  -p "instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976." \

For {0,n}:
Here is the call from the log I shared in CLI form. Same result with Llama-3b. Actually, this is really weird. It's generating completely random (but coherent outputs) using the CLI. But formatted in the grammar. It's possible that the way I'm structuring the call through the server api isn't correct?

grammar: Some(\"root ::= item{0,7}\\nitem ::= \\\"- \\\" [^\\\\r\\\\n\\\\x0b\\\\x0c\\\\x85\\\\u2028\\\\u2029]+ \\\"\\\\n\\\"\\n\")

For {n,n}:
RE: the issue of the model attempting to use all available repetitions in {n,n}, one thing I've figured out is to use a stop word to terminate generation.

@ochafik
Copy link
Collaborator Author

ochafik commented Jun 11, 2024

@ShelbyJenkins Oh this one is weirding me out indeed.

A workaround seems to be root ::= "\n" item{0,7}, but it feels like the totally empty case is broken in grammar-constrained inference that somehow does work in the integration tests (in which I cannot repro the issue, cc/ @HanClinto )

Could you please open a bug so we can keep tracking this?

Tested w/ Llama3 and Phi3, even tried adding things like Format the result as a bullet list (starting with '- '), to no avail

./main --log-disable --no-display-prompt \
  -mu https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf \
  --grammar 'root ::= item{0,7}
  item ::= "- " [^\r\n\x0b\x0c\x85\u2028\u2029]+ "\n"' \
  -p "instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976." \

./main --log-disable --no-display-prompt \
  -mu https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF/resolve/main/Phi-3-mini-4k-instruct-Q8_0.gguf \
  --grammar 'root ::= item{0,7}
  item ::= "- " [^\r\n\x0b\x0c\x85\u2028\u2029]+ "\n"' \
  -p "instructions: For each discrete topic in this text, provide a short ELI5 sentence describing the topic.\\nuser input: In computer science, Backus–Naur form (/ˌbækəs ˈnaʊər/) (BNF or Backus normal form) is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.\\n\\nOver time, many extensions and variants of the original Backus–Naur notation have been created; some are exactly defined, including extended Backus–Naur form (EBNF) and augmented Backus–Naur form (ABNF). Invented in 1976." \

@ExtReMLapin
Copy link
Contributor

I'm trying to port this to python wrapper but I did something wrong, and god knows why he decided to re-implement everything in python.

If someone's ready to help, feel free to add a review in there : abetlen/llama-cpp-python#1637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request examples python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants