Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granite three support #608

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gabe-l-hart
Copy link

@gabe-l-hart gabe-l-hart commented Nov 4, 2024

Description

This PR adds support for the "granite" and "granitemoe" architectures in order to support IBM's Granite 3.0. The changes mirror those added in llama.cpp upstream:

These models are currently available via HuggingFace and Ollama:

Testing

I did my development on a Mac M3 without gmake natively installed. To avoid a system-level install, I wrapped my dev environment in docker with the following two scripts:

build_dockerized.sh
#!/usr/bin/env bash

cd $(dirname ${BASH_SOURCE[0]})

docker buildx build . -t llamafile-builder:latest --load
docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest
build_in_docker.sh
#!/usr/bin/env bash

gguf_file=$1
if [ $# -ge 2 ]
then
    model_name=$2
else
    model_name=$(basename $gguf_file | cut -d'.' -f 1)
fi
echo "Model Name: $model_name"

# Build (NOTE: First build may fail due to the need to download tools)
make -j || make -j

# Install the built binaries
make install PREFIX=/usr/local

# Make a temp dir to work in
start_dir=$PWD
temp_dir=$(mktemp -d)
cd $temp_dir

# Copy over the model and base binary
echo "Copying source materials..."
cp $gguf_file .
cp $(which llamafile) $model_name.llamafile

# Make the .args file
echo "Making .args file..."
echo "-m
$(basename $gguf_file)
--host
0.0.0.0
-ngl
9999
..." > .args

# Pack it all together
echo "Packing with zipalign..."
zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args

# Move it back to the root dir
mv $model_name.llamafile $start_dir/
echo "DONE"

With these scripts, my workflow was:

  1. Download pre-quantized versions of the models (e.g. ollama pull then grab the $HOME/.ollama/models/blobs/... blob for the GGUF file)
    • NOTE: IBM does not currently host official quantized versions, but there are also many community quantizations available in HF (dense, moe)
  2. Launch the docker build shell (./build_dockerized.sh)
  3. Build the llamafile inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)
  4. Run the llamafile outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")

Open Questions

Solved! I found the PR added after mine in llama.cpp to update the chat template to support "granite": ggerganov/llama.cpp#10013

When running in interactive mode, the chat template seems to be using different special tokens besides those defined in the chat_template metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.

@DK013
Copy link

DK013 commented Nov 6, 2024

I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart

@BradHutchings
Copy link

Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!

-Brad

@BradHutchings
Copy link

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

@gabe-l-hart
Copy link
Author

Hi @jart! I wanted to check in and see if this PR is something you would consider for upstream merging. I see that you use llama.cpp/README.llamafile to track the version of llama.cpp being used and the list of local modifications on top. I didn't see a clean way to re-bump the commit and apply those deltas, but I'd be happy to re-do this change set to be a full llama.cpp bump if that's preferred.

@DK013
Copy link

DK013 commented Nov 24, 2024

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

I have been wanting to try it but wasn't getting enough time to sit and resolve the errors on a windows machine. @BradHutchings would you mind sharing your build so I can run some tests as well?
Thanks in advance

@pawel665j
Copy link

pawel665j commented Nov 24, 2024 via email

@BradHutchings
Copy link

@DK013 My llamafile builds are here: https://huggingface.co/bradhutchings/DemoMachine-LLMs

This is a port of the work done in llama.cpp directly
ggerganov/llama.cpp#9412

Branch: GraniteThreeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
This is a port of the work done in llama.cpp directly
ggerganov/llama.cpp#9438

Branch: GraniteThreeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteThreeSupport

This is a port of the work done in llama.cpp with a slight tweak for the
tool call response:
ggerganov/llama.cpp#10013

Signed-off-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants