You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:
<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information
User question: "How do I do <y>?"<eot>
In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).
This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.
Describe the solution you'd like
I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:
Cache is not mutable once built
(Does not pop in __getitem__)
It uses a trie for finding the longest matching prefix (if any) in the cache
It has a convenience factory method for building from a list of prompts
So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.
Complication / Details with this
I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.
I don't think that this will be a factor anymore because ggml-org/llama.cpp#9294 has removed serializing / deserializing the RNG when saving.
Describe alternatives you've considered
Use lower-level state saving functions (rather than pickling llama.save_state() to save less on-disk than full model file
Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
This also complicates the __getitem__ logic.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:
In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).
This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.
Describe the solution you'd like
I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing
LlamaDiskCache
but:pop
in__getitem__
)So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.
Complication / Details with this
I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.
I.e., skip loading and set seed for reproducibility: tc-wolf/llama.cpp@ea43d92
I don't think that this will be a factor anymore because ggml-org/llama.cpp#9294 has removed serializing / deserializing the RNG when saving.
Describe alternatives you've considered
llama.save_state()
to save less on-disk than full model file__getitem__
logic.The text was updated successfully, but these errors were encountered: