-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for llguidance grammars #10224
base: master
Are you sure you want to change the base?
Conversation
Awesome. Is it an llama.cpp implementation for dottxt-ai/outlines ? |
No. The approach of llguidance is more like the current llama.cpp grammars, in that both compute the token mask on the fly. Outlines pre-computes the token masks for all states of the automaton resulting from compiling the constraint. This limits the expressiveness of constraints and has high startup costs (though of course near-zero sampling costs). I added some notes to llguidance readme. |
Really excited to see this -- great work!
Out of curiosity, how does llguidance compare to llama.cpp? Current JSON schema -> GBNF limitations are documented here: https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf I would love to do some speed tests with similarly complex grammars to see how llguidance compares against our stock implementation. Have you happened to run any of those tests yourself? |
I believe llguidance json schema coverage will be more comprehensive once this is merged guidance-ai/llguidance#48 We should have more data on this soon. As for performance, on the example I run there doesn't seem to be much difference between llama.cpp grammar, llguidance and unconstrained. I think this is due to llama.cpp grammars using rejection sampling (resampling). Llguidance currently always computes the full mask (though I want to also allow rejection sampling). @HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues? |
#7810 is one issue where the stacks get so large that we SIGSEGV out (GBNF is available here). But that's CFG, so I'm not sure how useful it would be to you. Sadly, while I've collected a number of particularly brutal GBNF grammars, I don't have any particularly juicy JSON schema specs for you to try. #7703 has some tricky bits re: That said, I'm interested in learning how I can best help to put this through its paces -- I'm excited by the option of having competing grammar engines in llama.cpp! |
@ochafik, do you have any JSON schemas available that caused the GBNF engine to bog down? |
I tried the grammar from #7810 converted to lark, see xml-lark.txt I needed to change I run it with
The generation with and without grammar is the same (35t/s). The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long. |
Very nice!!! Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges. If I were looking to dive into llguidance, that I suppose that feature might make a good first contribution.
Fascinating. I guess I don't know enough about the vagaries of grammar parsers, but I wonder if making this same change would let the GBNF parser successfully work through this grammar as well? Or is this a quirk of how llguidance handles things?
Very nice!! How does the speed compare if you make the same
Agreed. Fundamentally, I'm not sure that a stack-based parser is going to be able to parse a grammar like this (though neither do I think a precompiled grammar engine like Outlines would be able to handle it) -- I wonder if the only way to handle this would be an engine that uses backtracking (and so would have some wasteful token generation), but at least it would be able to work its way through it. The other option that I've considered with grammars that grow this large is to simply do random culling of the tree once it grows above a certain limit (possibly with some heuristic that prioritizes shorter / simpler stacks over longer / more-complex ones). You would lose the advantage of generating across every conceivable valid path, but at least what you generate would still be grammatically correct, and it could actually finish without crashing. But that's an optimization for another day. For now, I'm interested in seeing how LLG performance stacks up against the GBNF grammar engine (so far I'm really optimistic!) and secondly, seeing what we can do to make the coupling / separation cleaner. One issue is the compilation / linking issue, and the second is how to specify LLG vs. GBNF. I actually like the way that you use "llg:" as a special prefix. The alternative would be to specify A little bit of duct tape is acceptable (especially because I don't currently feel strongly about which path is best), but I would like us to be thinking ahead about the ideal scenario for incorporating multiple grammar engines alongside each other. I'm very interested in hearing others' thoughts on this topic. |
I currently use JSON string parsing for string literals; it's easy to fix (it's just the lark frontend issue). You can use Created issue guidance-ai/llguidance#54 TLDR: you can't do that TEXT trick with llama.cpp grammars. Back in the 1970s when computers were slow, people figured out that you can first deal with words (also called tokens (not be confused with LLM tokens) or lexemes) and only then you deal with syntax. This is because splitting text into words is cheaper than parsing it. And so regular expressions were used for "lexing" (splitting into words or lexemes) and context-free grammars were used for the higher-level parsing. Now, this is theoretically unnecessary, since regular languages are subset of context-free languages. It's just that doing lexing first and parsing on top of larger items just happens to be quite a bit faster. Llama.cpp grammars do not have this separation between lexing and parsing, which is definitely cleaner and easier to understand. However, while computers are much faster now, the token masking is this specific problem where you have to do lots of parsing in a very short time. I don't think you can do it fast enough without lexer. Also, typically the LLM tokens are somewhat aligned with lexemes, meaning that when you walk the prefix tree of all tokens, you do lexer operations 99.5+% of the time, and they are much cheaper. On the plus side, virtually all programming language definitions (including JSON) have this lexer/parser separation. Oh, and BTW, if you have lexer (and you mostly do), the size of the linked grammar is absolutely not a problem. We've been successfully running 4MB JSON schemas through LLG. |
I really look forward to seeing this merge. So far I've used llama.cpp grammars and guidance, but I found it redundant that we have to convert EBNF grammars to guidance grammar functions syntax—with complex grammars this gets ugly quickly. |
This is an experimental, very draft PR which adds support for Rust-based llguidance constrained sampling library. This is mostly meant to elicit comments from users and maintainers on if and how it could be integrated.
llguidance
provides features similar to llama.cpp grammars (full context-free grammar parsing and JSON schemas), however it takes a somewhat different approach to parsing - it splits the allowed token set (mask) computation between a lexer and parser. The lexer use uses derivatives of regular expressions, while the parser uses Earley algorithm. Due to lexer usage and lots of low-level optimizations, llguidance can compute the token mask for 100k tokens in about 1ms for all typical JSON schemas and most other grammars as well. Just as in llama.cpp grammars, there is no significant pre-computation at startup.llguidance
can also "fast-forward" tokens, for example in case of a JSON schema, after{"
is generated, the full key name (consisting of a few tokens) can be processed in a parallel prefill step. This is however not yet hooked up to llama.cpp in this patch. If you're interested, this is hooked up in Guidance via llama.cpp python bindings.This patch adds
llama_sampler_init_llg()
which takes two strings: grammar type and the grammar. Following types are supported:"regex"
- regular expressions (following Rust regex crate syntax)"json"
or"json_schema"
- a large subset of JSON schemas (but see issue)"lark"
- context-free grammars in (a subset of) Lark format"llguidance"
or"guidance"
- internal (JSON-based) formatSupporting llama.cpp grammars as they currently are would be difficult, since they do not distinguish between lexer and parser. The
lark
format is conceptually very similar though.I also hacked
common/sampling.cpp
to recognize"llg:"
prefix in the grammar string. You can for example pass"llg:regex:[A-Z ]+"
. I also hackedcommon/json-schema-to-grammar.cpp
to just return"llg:json_schema:" + schema
, so the-j
option tollama-cli
and JSON mode options tollama-server
use llguidance (when enabled).Trying it out