Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilade/bitnet ternary #294

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
bd80749
ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
compilade Jun 19, 2024
7ef4254
ggml-quants : faster 1.625 bpw AVX2 vec_dot
compilade Jun 19, 2024
48b73b8
ggml-quants : substract 1 when back in epi8
compilade Jun 19, 2024
ef1e345
ggml-quants : Q2_2 now faster than Q4_K on with AVX2
compilade Jun 20, 2024
638ad52
ggml-quants : cleanup Q1_3 code formatting
compilade Jun 23, 2024
9465ec6
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
compilade Jun 25, 2024
89dc3b2
ggml-quants : use ceiling division when quantizing q1_3
compilade Jun 26, 2024
961e293
convert-hf : simplify BitNet pre-quantization
compilade Jun 26, 2024
0996149
convert-hf : allow converting the weird BitNet 1.3B
compilade Jun 27, 2024
bfd2f21
bitnet : replace 1.58b with b1.58, as in the paper
compilade Jun 29, 2024
ec50944
ggml-quants : fix build failure on Windows
compilade Jun 29, 2024
8fbd593
ggml-quants : attempt to fix Arm 32-bit support
compilade Jun 29, 2024
dd3e62a
ggml : add some informative comments in q1_3 vec_dot
compilade Jul 29, 2024
79a278e
Merge branch 'master' into compilade/bitnet-ternary
compilade Jul 29, 2024
77b8f84
ggml : add TQ1_0 and TQ2_0 ternary quantization types
compilade Jul 30, 2024
560873f
ggml : even faster TQ2_0
compilade Jul 31, 2024
e971957
ggml : also faster TQ1_0
compilade Jul 31, 2024
a6dd699
ggml : fix build issues in certain environments
compilade Aug 1, 2024
5417089
ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
compilade Aug 1, 2024
45719a2
ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
compilade Aug 1, 2024
04eec58
ggml : remove q1_3 and q2_2
compilade Aug 2, 2024
f034aa1
ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
compilade Aug 3, 2024
96b3d41
ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
compilade Aug 7, 2024
7c5bfd5
Optimize Vulkan backend for better CPU performance and less GPU synch…
mtavenrath Aug 11, 2024
33309f6
llama : check all graph nodes when searching for result_embd_pooled (…
fairydreaming Aug 11, 2024
a21c6fd
update guide (#8909)
arthw Aug 11, 2024
8cd1bcf
flake.lock: Update (#8979)
ggerganov Aug 11, 2024
4134999
gguf-py : Numpy dequantization for most types (#8939)
compilade Aug 11, 2024
d911cd1
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 11, 2024
3a0bf17
gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
compilade Aug 12, 2024
5ef07e2
server : handle models with missing EOS token (#8997)
ggerganov Aug 12, 2024
d3ae0ee
py : fix requirements check '==' -> '~=' (#8982)
ggerganov Aug 12, 2024
2589292
Fix a spelling mistake (#9001)
Septa2112 Aug 12, 2024
df5478f
ggml: fix div-by-zero (#9003)
DavidKorczynski Aug 12, 2024
1262e7e
grammar-parser : fix possible null-deref (#9004)
DavidKorczynski Aug 12, 2024
84eb2f4
docs: introduce gpustack and gguf-parser (#8873)
thxCode Aug 12, 2024
0fd93cd
llama : model-based max number of graph nodes calculation (#8970)
nicoboss Aug 12, 2024
1f67436
ci : enable RPC in all of the released builds (#9006)
rgerganov Aug 12, 2024
fc4ca27
ci : fix github workflow vulnerable to script injection (#9008)
diogoteles08 Aug 12, 2024
828d6ff
export-lora : throw error if lora is quantized (#9002)
ngxson Aug 13, 2024
06943a6
ggml : move rope type enum to ggml.h (#8949)
danbev Aug 13, 2024
895004f
convert : allow direct conversion to TQ1_0 and TQ2_0
compilade Aug 13, 2024
69f7726
ggml-quants : allow using ARM dot product instructions for TQ1_0
compilade Aug 13, 2024
82b2404
Merge branch 'master' into compilade/bitnet-ternary
compilade Aug 13, 2024
35cc556
ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
compilade Aug 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,8 @@ jobs:

- name: Server bench
id: server_bench
env:
HEAD_REF: ${{ github.head_ref || github.ref_name }}
run: |
set -eux

Expand All @@ -137,7 +139,7 @@ jobs:
python bench.py \
--runner-label ${{ env.RUNNER_LABEL }} \
--name ${{ github.job }} \
--branch ${{ github.head_ref || github.ref_name }} \
--branch $HEAD_REF \
--commit ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha }} \
--scenario script.js \
--duration ${{ github.event.inputs.duration || env.DURATION }} \
Expand Down
22 changes: 10 additions & 12 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
sysctl -a
mkdir build
cd build
cmake -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF ..
cmake -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF ..
cmake --build . --config Release -j $(sysctl -n hw.logicalcpu)

- name: Test
Expand Down Expand Up @@ -105,7 +105,7 @@ jobs:
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggerganov/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL=OFF -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF
cmake -B build -DLLAMA_FATAL_WARNINGS=ON -DGGML_METAL=OFF -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

- name: Test
Expand Down Expand Up @@ -222,7 +222,7 @@ jobs:
run: |
mkdir build
cd build
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON -DBUILD_SHARED_LIBS=OFF
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF
cmake --build . --config Release -j $(nproc)

- name: Test
Expand Down Expand Up @@ -696,22 +696,20 @@ jobs:
strategy:
matrix:
include:
- build: 'rpc-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=ON'
- build: 'noavx-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DBUILD_SHARED_LIBS=ON'
- build: 'avx2-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=ON'
- build: 'avx-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX2=OFF -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX2=OFF -DBUILD_SHARED_LIBS=ON'
- build: 'avx512-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_AVX512=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_AVX512=ON -DBUILD_SHARED_LIBS=ON'
- build: 'openblas-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_BLAS=ON -DBUILD_SHARED_LIBS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BLAS=ON -DBUILD_SHARED_LIBS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'kompute-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON -DBUILD_SHARED_LIBS=ON'
- build: 'vulkan-x64'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=ON'
defines: '-DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=ON'
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=ON'
- build: 'msvc-arm64'
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/python-check-requirements.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,13 @@ on:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,10 +186,12 @@ Unless otherwise noted these projects are open-source with permissive licensing:

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage

**Infrastructure:**

- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs

**Games:**
- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
Expand Down
3 changes: 3 additions & 0 deletions common/grammar-parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,9 @@ namespace grammar_parser {
}
// Validate the state to ensure that all rules are defined
for (const auto & rule : state.rules) {
if (rule.empty()) {
throw std::runtime_error("Undefined rule");
}
for (const auto & elem : rule) {
if (elem.type == LLAMA_GRETYPE_RULE_REF) {
// Ensure that the rule at that location exists
Expand Down
47 changes: 33 additions & 14 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,20 @@ def prepare_tensors(self):
):
data_qtype = gguf.GGMLQuantizationType.F32

if data_qtype is False and any(
self.match_model_tensor_name(new_name, key, bid)
for key in (
gguf.MODEL_TENSOR.TOKEN_EMBD,
gguf.MODEL_TENSOR.OUTPUT,
)
):
if self.ftype in (
gguf.LlamaFileType.MOSTLY_TQ1_0,
gguf.LlamaFileType.MOSTLY_TQ2_0,
):
# TODO: use Q4_K and Q6_K
data_qtype = gguf.GGMLQuantizationType.F16

# No override (data_qtype is False), or wants to be quantized (data_qtype is True)
if isinstance(data_qtype, bool):
if self.ftype == gguf.LlamaFileType.ALL_F32:
Expand All @@ -311,6 +325,10 @@ def prepare_tensors(self):
data_qtype = gguf.GGMLQuantizationType.BF16
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0:
data_qtype = gguf.GGMLQuantizationType.Q8_0
elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ1_0:
data_qtype = gguf.GGMLQuantizationType.TQ1_0
elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ2_0:
data_qtype = gguf.GGMLQuantizationType.TQ2_0
else:
raise ValueError(f"Unknown file type: {self.ftype.name}")

Expand Down Expand Up @@ -1606,15 +1624,16 @@ def set_gguf_parameters(self):
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
self.gguf_writer.add_rope_scaling_factor(1.0)

def weight_quant(self, weight):
def weight_quant(self, weight: Tensor) -> Tensor:
dtype = weight.dtype
weight = weight.float()
s = 1 / weight.abs().mean().clamp(min=1e-5)
weight = (weight * s).round().clamp(-1, 1) / s
scale = weight.abs().max().unsqueeze(0)
weight = torch.where(weight.abs().less(1e-6), 0, weight).type(dtype)
weight = torch.sign(weight).type(dtype)
return weight.type(dtype), scale.type(torch.float32)
scale = weight.abs().mean().clamp(min=1e-5)
iscale = 1 / scale
# TODO: multiply by the scale directly instead of inverting it twice
# (this is also unnecessarily doubly inverted upstream)
# ref: https://huggingface.co/1bitLLM/bitnet_b1_58-3B/blob/af89e318d78a70802061246bf037199d2fb97020/utils_quant.py#L10
result = (weight * iscale).round().clamp(-1, 1) / iscale
return result.type(dtype)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
new_name = self.map_tensor_name(name)
Expand All @@ -1629,11 +1648,9 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
gguf.MODEL_TENSOR.FFN_GATE,
]):
# transform weight into 1/0/-1 (in fp32)
weight_torch, scale_torch = self.weight_quant(data_torch)
yield (new_name, weight_torch)
yield (new_name.removesuffix(".weight") + ".scale", scale_torch)
else:
yield (new_name, data_torch)
data_torch = self.weight_quant(data_torch)

yield (new_name, data_torch)


@Model.register("GrokForCausalLM")
Expand Down Expand Up @@ -3815,8 +3832,8 @@ def parse_args() -> argparse.Namespace:
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
)
parser.add_argument(
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "auto"], default="f16",
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "auto"], default="f16",
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
)
parser.add_argument(
"--bigendian", action="store_true",
Expand Down Expand Up @@ -3903,6 +3920,8 @@ def main() -> None:
"f16": gguf.LlamaFileType.MOSTLY_F16,
"bf16": gguf.LlamaFileType.MOSTLY_BF16,
"q8_0": gguf.LlamaFileType.MOSTLY_Q8_0,
"tq1_0": gguf.LlamaFileType.MOSTLY_TQ1_0,
"tq2_0": gguf.LlamaFileType.MOSTLY_TQ2_0,
"auto": gguf.LlamaFileType.GUESSED,
}

Expand Down
Loading
Loading