llama : save downloaded models to local cache #7252

ggerganov · 2024-05-13T09:20:51Z

We've recently introduced the --hf-repo and --hf-file helper args to common in #6234:

ref #4735 #5501 #6085 #6098

Sample usage:

./bin/main \
  --hf-repo TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF \
  --hf-file ggml-model-q4_0.gguf \
  -m tinyllama-1.1-v0.2-q4_0.gguf \
  -p "I believe the meaning of life is" -n 32

./bin/main \
  --hf-repo TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  -m tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
  -p "I believe the meaning of life is" -n 32

Downloads `https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf` and saves it to `tinyllama-1.1-v0.2-q4_0.gguf`

Requires build with `LLAMA_CURL`

Currently, the downloaded files via curl are stored in a destination based on the --model CLI arg.

If --model is not provided, we would like to auto-store the downloaded model files in a local cache, similar to what other frameworks like HF/transformers do.

Here is the documentation of this functionality in HF for convenience and reference:

URL: https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup

### Cache setup

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

1. Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
2. Shell environment variable: HF_HOME.
3. Shell environment variable: XDG_CACHE_HOME + /huggingface.

🤗 Transformers will use the shell environment variables PYTORCH_TRANSFORMERS_CACHE or PYTORCH_PRETRAINED_BERT_CACHE if you are coming from an earlier iteration of this library and have set those environment variables, unless you specify the shell environment variable TRANSFORMERS_CACHE.

The goal of this issue is to implement similar functionality in llama.cpp. The environment variables should be named accordingly to the llama.cpp patterns and the local cache should be utilized only when the --model CLI argument is not explicitly provided in commands like main and server

P.S. I'm interested in exercising "Copilot Workspace" to see if it would be capable to implement this task by itself

P.S.2 So CW is quite useless at this point for llama.cpp - it cannot handle files a few thousand lines of code:

CW snapshot: https://copilot-workspace.githubnext.com/ggerganov/llama.cpp/issues/7252?shareId=379fdaa0-3580-46ba-be68-cb061518a38c

The text was updated successfully, but these errors were encountered:

julien-c · 2024-05-13T09:34:00Z

FWIW the HF cache layout is quite nice and it's git-aware:

@LysandreJik and I implemented it a while ago and it's been working well.

For instance this is the layout for one given model repo with two revisions/two files inside of it:

    [  96]  .
    └── [ 160]  models--julien-c--EsperBERTo-small
        ├── [ 160]  blobs
        │   ├── [321M]  403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
        │   ├── [ 398]  7cb18dc9bafbfcf74629a4b760af1b160957a83e
        │   └── [1.4K]  d7edf6bd2a681fb0175f7735299831ee1b22b812
        ├── [  96]  refs
        │   └── [  40]  main
        └── [ 128]  snapshots
            ├── [ 128]  2439f60ef33a0d46d85da5001d52aeda5b00ce9f
            │   ├── [  52]  README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
            │   └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
            └── [ 128]  bbc77c8132af1cc5cf678da3f1ddf2de43606d48
                ├── [  52]  README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e
                └── [  76]  pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd

ngxson · 2024-05-13T22:14:16Z

Probably we can take advantage of Hub API. For example, to list all files in a repo: https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B/tree/main

This could potentially remove the need for --hf-file and etag checking

amirzia · 2024-05-17T22:36:56Z

Hi, this is my first contribution to this project.

I made a PR with a basic implementation of the cache mechanism. The dowloaded files are stored in the directory specified by LLAMA_CACHE env variable. If the env variable is not provided, the models are stored in the default cache directory: .cache/.

Let me know if I'm going to the right direction.

ggerganov · 2024-05-19T09:08:01Z

@amirzia I think the proposed changes are good - pretty much what I imagined as a first step.

I'm not sure what are the benefits of having a git-aware cache similar to HF, but if we think there are reasonable advantages, we can work on that to improve the functionality further. Maybe for now it's fine to merge the PR as it is

julien-c · 2024-05-22T10:14:33Z

organic community demand for a shared cached between all local ML apps: https://x.com/filipviz/status/1792981186446274625

amirzia · 2024-05-22T20:30:31Z

Should we agree on a common standard (layout and path)?

There is already this proposal for a standard path: https://filip.world/post/modelpath/. We also have the HF git-aware layout (which Julien seems to really like 😄).

Although I'm not sure if llama.cpp and other applications benefit from having the history of models.

ggerganov · 2024-05-23T07:44:50Z

Ah I see now. The shared location seems reasonable in order to have different apps sharing the same model data.

Although I'm not sure if llama.cpp and other applications benefit from having the history of models.

I also don't think that llama.cpp has use cases for the git-aware structure and it might not be trivial to implement in C++. Filesystem operations are real pain in C++

ngxson · 2024-12-13T16:23:30Z

I'm closing this issue since it's already implemented

ggerganov added enhancement New feature or request examples labels May 13, 2024

ggerganov changed the title ~~llama : save downloaded file to local cache~~ llama : save downloaded models to local cache May 13, 2024

ggerganov added this to github : copilot workspace May 13, 2024

ggerganov added the good first issue Good for newcomers label May 13, 2024

amirzia mentioned this issue May 17, 2024

examples: cache hf model when --model not provided #7353

Merged

ochafik mentioned this issue Jun 8, 2024

url: save -mu downloads to new cache location #7826

Merged

ngxson closed this as completed Dec 13, 2024

github-project-automation bot moved this to Done in github : copilot workspace Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : save downloaded models to local cache #7252

llama : save downloaded models to local cache #7252

ggerganov commented May 13, 2024 •

edited

Loading

julien-c commented May 13, 2024

ngxson commented May 13, 2024

amirzia commented May 17, 2024

ggerganov commented May 19, 2024

julien-c commented May 22, 2024

amirzia commented May 22, 2024

ggerganov commented May 23, 2024

ngxson commented Dec 13, 2024

llama : save downloaded models to local cache #7252

llama : save downloaded models to local cache #7252

Comments

ggerganov commented May 13, 2024 • edited Loading

julien-c commented May 13, 2024

ngxson commented May 13, 2024

amirzia commented May 17, 2024

ggerganov commented May 19, 2024

julien-c commented May 22, 2024

amirzia commented May 22, 2024

ggerganov commented May 23, 2024

ngxson commented Dec 13, 2024

ggerganov commented May 13, 2024 •

edited

Loading