Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilade/fix mpt pretok #231

Merged
merged 28 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
db2ffd5
llama : fix mpt and olmo pre-tokenizer
compilade Jun 30, 2024
ac0f33c
Merge branch 'master' into compilade/fix-mpt-pretok
compilade Jul 7, 2024
d5d30b2
llama : pre-tokenize non-special user-defined tokens first
compilade Jul 7, 2024
6b961e3
Merge branch 'master' into compilade/fix-mpt-pretok
compilade Jul 7, 2024
56df1fc
llama : fix detection of control-like user-defined tokens
compilade Jul 7, 2024
6e351e0
convert_hf : identify which user-defined tokens are control tokens
compilade Jul 7, 2024
f9d42c5
convert_hf : identify more added control tokens for SPM tokenziers
compilade Jul 8, 2024
31a1b0e
llama : fix Viking pre-tokenizer regex
compilade Jul 8, 2024
d6fe269
llama : fix command-r detokenization
compilade Jul 8, 2024
d4df785
convert_hf : reduce usages of the UNKNOWN token type
compilade Jul 9, 2024
98edea6
llama : add UNKNOWN tokens in the special tokens cache
compilade Jul 9, 2024
5b0b8d8
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372)
Alcpz Jul 9, 2024
a03e8dd
make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392)
JohannesGaessler Jul 9, 2024
e500d61
Deprecation warning to assist with migration to new binary names (#8283)
HanClinto Jul 9, 2024
fd560fe
Update README.md to fix broken link to docs (#8399)
andysalerno Jul 9, 2024
a59f8fd
Server: Enable setting default sampling parameters via command-line (…
HanClinto Jul 9, 2024
8f0fad4
py : fix extra space in convert_hf_to_gguf.py (#8407)
laik Jul 10, 2024
e4dd31f
py : fix converter for internlm2 (#8321)
RunningLeon Jul 10, 2024
a8be1e6
llama : add assert about missing llama_encode() call (#8400)
fairydreaming Jul 10, 2024
7a80710
msvc : silence codecvt c++17 deprecation warnings (#8395)
iboB Jul 10, 2024
cc61948
llama : C++20 compatibility for u8 strings (#8408)
iboB Jul 10, 2024
83321c6
gguf-py rel pipeline (#8410)
monatis Jul 10, 2024
0f1a39f
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
Dibakar Jul 10, 2024
6b2a849
ggml : move sgemm sources to llamafile subfolder (#8394)
ggerganov Jul 10, 2024
f4444d9
[SYCL] Use multi_ptr to clean up deprecated warnings (#8256)
Jul 10, 2024
dd07a12
Name Migration: Build the deprecation-warning 'main' binary every tim…
HanClinto Jul 10, 2024
afa6119
Merge branch 'master' into compilade/fix-mpt-pretok
compilade Jul 10, 2024
1caa20f
convert_hf : reduce usages of UNKNOWN for InternLM2
compilade Jul 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,6 @@ endif()
# option list
#

# general
option(LLAMA_CCACHE "llama: use ccache if available" ON)

# debug
option(LLAMA_ALL_WARNINGS "llama: enable all compiler warnings" ON)
option(LLAMA_ALL_WARNINGS_3RD_PARTY "llama: enable all compiler warnings in 3rd party libs" OFF)
Expand All @@ -77,7 +74,6 @@ option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info.cmake)

# override ggml options
set(GGML_CCACHE ${LLAMA_CCACHE})
set(GGML_SANITIZE_THREAD ${LLAMA_SANITIZE_THREAD})
set(GGML_SANITIZE_ADDRESS ${LLAMA_SANITIZE_ADDRESS})
set(GGML_SANITIZE_UNDEFINED ${LLAMA_SANITIZE_UNDEFINED})
Expand Down
100 changes: 88 additions & 12 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,14 @@ TEST_TARGETS = \
tests/test-tokenizer-1-spm

# Legacy build targets that were renamed in #7809, but should still be removed when the project is cleaned
LEGACY_TARGETS = main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
LEGACY_TARGETS_CLEAN = main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
simple batched batched-bench save-load-state server gguf gguf-split eval-callback llama-bench libllava.a llava-cli baby-llama \
retrieval speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead lookup passkey gritlm

# Legacy build targets that were renamed in #7809, but we want to build binaries that for them that output a deprecation warning if people try to use them.
# We don't want to clutter things too much, so we only build replacements for the most commonly used binaries.
LEGACY_TARGETS_BUILD = main quantize perplexity embedding server finetune

# Deprecation aliases
ifdef LLAMA_CUBLAS
$(error LLAMA_CUBLAS is removed. Use GGML_CUDA instead.)
Expand Down Expand Up @@ -193,7 +197,7 @@ ifdef GGML_RPC
BUILD_TARGETS += rpc-server
endif

default: $(BUILD_TARGETS)
default: $(BUILD_TARGETS) $(LEGACY_TARGETS_BUILD)

test: $(TEST_TARGETS)
@failures=0; \
Expand Down Expand Up @@ -228,7 +232,7 @@ test: $(TEST_TARGETS)
fi
@echo 'All tests passed.'

all: $(BUILD_TARGETS) $(TEST_TARGETS)
all: $(BUILD_TARGETS) $(TEST_TARGETS) $(LEGACY_TARGETS_BUILD)

ifdef RISCV_CROSS_COMPILE
CC := riscv64-unknown-linux-gnu-gcc
Expand All @@ -245,17 +249,22 @@ MK_CFLAGS = -std=c11 -fPIC
MK_CXXFLAGS = -std=c++11 -fPIC
MK_NVCCFLAGS = -std=c++11

ifndef LLAMA_NO_CCACHE
ifdef LLAMA_NO_CCACHE
GGML_NO_CCACHE := 1
DEPRECATE_WARNING := 1
endif

ifndef GGML_NO_CCACHE
CCACHE := $(shell which ccache)
ifdef CCACHE
export CCACHE_SLOPPINESS = time_macros
$(info I ccache found, compilation results will be cached. Disable with LLAMA_NO_CCACHE.)
$(info I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE.)
CC := $(CCACHE) $(CC)
CXX := $(CCACHE) $(CXX)
else
$(info I ccache not found. Consider installing it for faster compilation.)
endif # CCACHE
endif # LLAMA_NO_CCACHE
endif # GGML_NO_CCACHE

# clock_gettime came in POSIX.1b (1993)
# CLOCK_MONOTONIC came in POSIX.1-2001 / SUSv3 as optional
Expand Down Expand Up @@ -545,7 +554,7 @@ endif # GGML_BLIS

ifndef GGML_NO_LLAMAFILE
MK_CPPFLAGS += -DGGML_USE_LLAMAFILE
OBJ_GGML += ggml/src/sgemm.o
OBJ_GGML += ggml/src/llamafile/sgemm.o
endif

ifdef GGML_RPC
Expand Down Expand Up @@ -826,7 +835,8 @@ OBJ_GGML += \
ggml/src/ggml.o \
ggml/src/ggml-alloc.o \
ggml/src/ggml-backend.o \
ggml/src/ggml-quants.o
ggml/src/ggml-quants.o \
ggml/src/ggml-aarch64.o

OBJ_LLAMA = \
src/llama.o \
Expand Down Expand Up @@ -926,6 +936,7 @@ $(info - LLAMA_NO_LLAMAFILE)
$(info - LLAMA_NO_ACCELERATE)
$(info - LLAMA_NO_OPENMP)
$(info - LLAMA_NO_METAL)
$(info - LLAMA_NO_CCACHE)
$(info )
endif

Expand Down Expand Up @@ -959,15 +970,22 @@ ggml/src/ggml-quants.o: \
ggml/src/ggml-common.h
$(CC) $(CFLAGS) -c $< -o $@

ggml/src/ggml-aarch64.o: \
ggml/src/ggml-aarch64.c \
ggml/include/ggml.h \
ggml/src/ggml-aarch64.h \
ggml/src/ggml-common.h
$(CC) $(CFLAGS) -c $< -o $@

ggml/src/ggml-blas.o: \
ggml/src/ggml-blas.cpp \
ggml/include/ggml-blas.h
$(CXX) $(CXXFLAGS) -c $< -o $@

ifndef GGML_NO_LLAMAFILE
ggml/src/sgemm.o: \
ggml/src/sgemm.cpp \
ggml/src/sgemm.h \
ggml/src/llamafile/sgemm.o: \
ggml/src/llamafile/sgemm.cpp \
ggml/src/llamafile/sgemm.h \
ggml/include/ggml.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif # GGML_NO_LLAMAFILE
Expand Down Expand Up @@ -1092,7 +1110,7 @@ clean:
rm -vrf ggml/src/ggml-cuda/template-instances/*.o
rm -rvf $(BUILD_TARGETS)
rm -rvf $(TEST_TARGETS)
rm -rvf $(LEGACY_TARGETS)
rm -rvf $(LEGACY_TARGETS_CLEAN)
find examples pocs -type f -name "*.o" -delete

#
Expand Down Expand Up @@ -1488,3 +1506,61 @@ llama-q8dot: pocs/vdot/q8dot.cpp ggml/src/ggml.o \
$(OBJ_GGML)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)

#
# Deprecated binaries that we want to keep around long enough for people to migrate to the new filenames, then these can be removed.
#
# Mark legacy binary targets as .PHONY so that they are always checked.
.PHONY: main quantize perplexity embedding server finetune

# NOTE: We currently will always build the deprecation-warning `main` and `server` binaries to help users migrate.
# Eventually we will want to remove these target from building all the time.
main: examples/deprecation-warning/deprecation-warning.cpp
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "NOTICE: The 'main' binary is deprecated. Please use 'llama-cli' instead."

server: examples/deprecation-warning/deprecation-warning.cpp
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead."

quantize: examples/deprecation-warning/deprecation-warning.cpp
ifneq (,$(wildcard quantize))
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "#########"
@echo "WARNING: The 'quantize' binary is deprecated. Please use 'llama-quantize' instead."
@echo " Remove the 'quantize' binary to remove this warning."
@echo "#########"
endif

perplexity: examples/deprecation-warning/deprecation-warning.cpp
ifneq (,$(wildcard perplexity))
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "#########"
@echo "WARNING: The 'perplexity' binary is deprecated. Please use 'llama-perplexity' instead."
@echo " Remove the 'perplexity' binary to remove this warning."
@echo "#########"
endif

embedding: examples/deprecation-warning/deprecation-warning.cpp
ifneq (,$(wildcard embedding))
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "#########"
@echo "WARNING: The 'embedding' binary is deprecated. Please use 'llama-embedding' instead."
@echo " Remove the 'embedding' binary to remove this warning."
@echo "#########"
endif

finetune: examples/deprecation-warning/deprecation-warning.cpp
ifneq (,$(wildcard finetune))
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo "#########"
@echo "WARNING: The 'finetune' binary is deprecated. Please use 'llama-finetune' instead."
@echo " Remove the 'finetune' binary to remove this warning."
@echo "#########"
endif
1 change: 1 addition & 0 deletions Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ var sources = [
"ggml/src/ggml-alloc.c",
"ggml/src/ggml-backend.c",
"ggml/src/ggml-quants.c",
"ggml/src/ggml-aarch64.c",
]

var resources: [Resource] = []
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,7 @@ To learn more how to measure perplexity using llama.cpp, [read this documentatio
- [How to build](./docs/build.md)
- [Running on Docker](./docs/docker.md)
- [Build on Android](./docs/android.md)
- [Performance troubleshooting](./docs/token_generation_performance_tips.md)
- [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)

**Seminal papers and background on the models**
Expand Down
4 changes: 4 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
#if defined(_MSC_VER)
#define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING
#endif

#include "common.h"
// Change JSON_ASSERT from assert() to GGML_ASSERT:
#define JSON_ASSERT GGML_ASSERT
Expand Down
Loading
Loading