support for llguidance grammars #10224

mmoskal · 2024-11-09T01:53:40Z

This is an experimental, very draft PR which adds support for Rust-based llguidance constrained sampling library. This is mostly meant to elicit comments from users and maintainers on if and how it could be integrated.

llguidance provides features similar to llama.cpp grammars (full context-free grammar parsing and JSON schemas), however it takes a somewhat different approach to parsing - it splits the allowed token set (mask) computation between a lexer and parser. The lexer use uses derivatives of regular expressions, while the parser uses Earley algorithm. Due to lexer usage and lots of low-level optimizations, llguidance can compute the token mask for 100k tokens in about 1ms for all typical JSON schemas and most other grammars as well. Just as in llama.cpp grammars, there is no significant pre-computation at startup.

llguidance can also "fast-forward" tokens, for example in case of a JSON schema, after {" is generated, the full key name (consisting of a few tokens) can be processed in a parallel prefill step. This is however not yet hooked up to llama.cpp in this patch. If you're interested, this is hooked up in Guidance via llama.cpp python bindings.

This patch adds llama_sampler_init_llg() which takes two strings: grammar type and the grammar. Following types are supported:

"regex" - regular expressions (following Rust regex crate syntax)
"json" or "json_schema" - a large subset of JSON schemas (but see issue)
"lark" - context-free grammars in (a subset of) Lark format
"llguidance" or "guidance" - internal (JSON-based) format

Supporting llama.cpp grammars as they currently are would be difficult, since they do not distinguish between lexer and parser. The lark format is conceptually very similar though.

I also hacked common/sampling.cpp to recognize "llg:" prefix in the grammar string. You can for example pass "llg:regex:[A-Z ]+". I also hacked common/json-schema-to-grammar.cpp to just return "llg:json_schema:" + schema, so the -j option to llama-cli and JSON mode options to llama-server use llguidance (when enabled).

Trying it out

# build llguidance
git clone https://github.com/microsoft/llguidance
cd llguidance/parser
cargo build --release
# build llama.cpp with llguidance support
cd ../../llama.cpp
make LLGUIDANCE_PATH=../llguidance/parser/target/release -j llama-server

vonjackustc · 2024-11-12T14:20:34Z

Awesome. Is it an llama.cpp implementation for dottxt-ai/outlines ?

mmoskal · 2024-11-12T17:05:13Z

No. The approach of llguidance is more like the current llama.cpp grammars, in that both compute the token mask on the fly. Outlines pre-computes the token masks for all states of the automaton resulting from compiling the constraint. This limits the expressiveness of constraints and has high startup costs (though of course near-zero sampling costs).

I added some notes to llguidance readme.

HanClinto · 2024-11-13T16:54:48Z

Really excited to see this -- great work!

Currently Guidance has the best json schema coverage of all constrained encoding libraries.

Out of curiosity, how does llguidance compare to llama.cpp? Current JSON schema -> GBNF limitations are documented here: https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf

I would love to do some speed tests with similarly complex grammars to see how llguidance compares against our stock implementation. Have you happened to run any of those tests yourself?

mmoskal · 2024-11-14T05:42:59Z

I believe llguidance json schema coverage will be more comprehensive once this is merged guidance-ai/llguidance#48 We should have more data on this soon.

As for performance, on the example I run there doesn't seem to be much difference between llama.cpp grammar, llguidance and unconstrained. I think this is due to llama.cpp grammars using rejection sampling (resampling). Llguidance currently always computes the full mask (though I want to also allow rejection sampling). @HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

HanClinto · 2024-11-14T06:48:18Z

@HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

#7810 is one issue where the stacks get so large that we SIGSEGV out (GBNF is available here). But that's CFG, so I'm not sure how useful it would be to you.

Sadly, while I've collected a number of particularly brutal GBNF grammars, I don't have any particularly juicy JSON schema specs for you to try. #7703 has some tricky bits re: required and oneOf both being set, and it's a bit of a conflict / edge-case in the spec, but the performance of that particular spec seemed to parse pretty fine.

That said, I'm interested in learning how I can best help to put this through its paces -- I'm excited by the option of having competing grammar engines in llama.cpp!

HanClinto · 2024-11-14T06:50:18Z

@ochafik, do you have any JSON schemas available that caused the GBNF engine to bog down?

mmoskal · 2024-11-14T18:42:53Z

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps. Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

HanClinto · 2024-11-14T21:42:52Z

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps.

Very nice!!!

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

If I were looking to dive into llguidance, that I suppose that feature might make a good first contribution.

Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

Fascinating. I guess I don't know enough about the vagaries of grammar parsers, but I wonder if making this same change would let the GBNF parser successfully work through this grammar as well? Or is this a quirk of how llguidance handles things?

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

Very nice!!

How does the speed compare if you make the same text / TEXT change in the GBNF version? This feels like it would be a very nice apples-to-apples comparison?

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

Agreed. Fundamentally, I'm not sure that a stack-based parser is going to be able to parse a grammar like this (though neither do I think a precompiled grammar engine like Outlines would be able to handle it) -- I wonder if the only way to handle this would be an engine that uses backtracking (and so would have some wasteful token generation), but at least it would be able to work its way through it.

The other option that I've considered with grammars that grow this large is to simply do random culling of the tree once it grows above a certain limit (possibly with some heuristic that prioritizes shorter / simpler stacks over longer / more-complex ones). You would lose the advantage of generating across every conceivable valid path, but at least what you generate would still be grammatically correct, and it could actually finish without crashing.

But that's an optimization for another day.

For now, I'm interested in seeing how LLG performance stacks up against the GBNF grammar engine (so far I'm really optimistic!) and secondly, seeing what we can do to make the coupling / separation cleaner. One issue is the compilation / linking issue, and the second is how to specify LLG vs. GBNF.

I actually like the way that you use "llg:" as a special prefix. The alternative would be to specify --grammar as one command line option, and --grammar-llg (or maybe just --llg) as a second command-line option, and whatever parameter is set determines which grammar sampler is used.

A little bit of duct tape is acceptable (especially because I don't currently feel strongly about which path is best), but I would like us to be thinking ahead about the ideal scenario for incorporating multiple grammar engines alongside each other. I'm very interested in hearing others' thoughts on this topic.

mmoskal · 2024-11-14T23:26:33Z

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

I currently use JSON string parsing for string literals; it's easy to fix (it's just the lark frontend issue). You can use \u0000 no problem and similar for unicode ranges. For regexes, I use regex-syntax rust crate which does support \x00 and similar (though I think it requires these to be valid utf8 (\x00 is but eg \xFF isn't))

Created issue guidance-ai/llguidance#54

TLDR: you can't do that TEXT trick with llama.cpp grammars.

Back in the 1970s when computers were slow, people figured out that you can first deal with words (also called tokens (not be confused with LLM tokens) or lexemes) and only then you deal with syntax. This is because splitting text into words is cheaper than parsing it. And so regular expressions were used for "lexing" (splitting into words or lexemes) and context-free grammars were used for the higher-level parsing. Now, this is theoretically unnecessary, since regular languages are subset of context-free languages. It's just that doing lexing first and parsing on top of larger items just happens to be quite a bit faster.

Llama.cpp grammars do not have this separation between lexing and parsing, which is definitely cleaner and easier to understand. However, while computers are much faster now, the token masking is this specific problem where you have to do lots of parsing in a very short time. I don't think you can do it fast enough without lexer.

Also, typically the LLM tokens are somewhat aligned with lexemes, meaning that when you walk the prefix tree of all tokens, you do lexer operations 99.5+% of the time, and they are much cheaper.

On the plus side, virtually all programming language definitions (including JSON) have this lexer/parser separation.

Oh, and BTW, if you have lexer (and you mostly do), the size of the linked grammar is absolutely not a problem. We've been successfully running 4MB JSON schemas through LLG.

ibehnam · 2025-01-08T20:54:11Z

I really look forward to seeing this merge. So far I've used llama.cpp grammars and guidance, but I found it redundant that we have to convert EBNF grammars to guidance grammar functions syntax—with complex grammars this gets ugly quickly.

mmoskal added 7 commits November 7, 2024 08:39

first draft of llguidance sampler

1f65f7a

use 'llg:' grammar prefix as marker

9d474f2

use public model APIs, not vocab

7d2b818

implement tokenizer

3fb701d

use llguidance for json schemas

23b9e59

clean up LLG config

b51a751

clean up

20ac793

mmoskal mentioned this pull request Nov 9, 2024

llama : speed-up grammar sampling #4218

Open

HanClinto mentioned this pull request Dec 11, 2024

Bug: llama-gbnf-validator parses grammar but gets a seg fault when validating an input string against the grammar #10321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for llguidance grammars #10224

support for llguidance grammars #10224

mmoskal commented Nov 9, 2024

vonjackustc commented Nov 12, 2024

mmoskal commented Nov 12, 2024

HanClinto commented Nov 13, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024 •

edited

Loading

ibehnam commented Jan 8, 2025

support for llguidance grammars #10224

Are you sure you want to change the base?

support for llguidance grammars #10224

Conversation

mmoskal commented Nov 9, 2024

Trying it out

vonjackustc commented Nov 12, 2024

mmoskal commented Nov 12, 2024

HanClinto commented Nov 13, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024 • edited Loading

ibehnam commented Jan 8, 2025

mmoskal commented Nov 14, 2024 •

edited

Loading