Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2586 #103

Merged
merged 118 commits into from
Apr 3, 2024
Merged

b2586 #103

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
47cc7a7
Server: Handle n_keep parameter in the request (#6174)
jkarthic Mar 20, 2024
f8c4e74
llava : add a MobileVLM_V2-1.7B backup (#6152)
ZiangWu-77 Mar 20, 2024
d795988
Revert "llava : add a MobileVLM_V2-1.7B backup (#6152)"
ggerganov Mar 20, 2024
bc0baab
server : allow to override -ngl in tests (#6170)
ggerganov Mar 20, 2024
6b7e76d
gitignore : ignore curl-related files
ggerganov Mar 20, 2024
91f8ad1
Server: version bump for httplib and json (#6169)
ngxson Mar 20, 2024
ccf58aa
cuda : refactor to remove global resources (#6170)
slaren Mar 20, 2024
272935b
llava : add MobileVLM_V2 backup (#6175)
ZiangWu-77 Mar 20, 2024
f9c7ba3
llava : update MobileVLM-README.md (#6180)
ZiangWu-77 Mar 20, 2024
1c51f98
cuda : print the returned error when CUDA initialization fails (#6185)
slaren Mar 20, 2024
42e21c6
cuda : fix conflict with std::swap (#6186)
slaren Mar 21, 2024
c5b8595
Add nvidia and amd backends (#6157)
AidanBeltonS Mar 21, 2024
76aa30a
Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183)
ikawrakow Mar 21, 2024
5e43ba8
build : add mac pre-build binaries (#6182)
Vaibhavs10 Mar 21, 2024
1943c01
ci : fix indentation error (#6195)
Vaibhavs10 Mar 21, 2024
5b7b0ac
json-schema-to-grammar improvements (+ added to server) (#5978)
ochafik Mar 21, 2024
cfd3be7
ggml : same IQ4_NL quantization for CPU/CUDA/Metal (#6196)
ikawrakow Mar 21, 2024
03a8f8f
cuda : fix LLAMA_CUDA_F16 build (#6197)
slaren Mar 21, 2024
924ce1d
tests : disable system() calls (#6198)
ggerganov Mar 21, 2024
f372c49
Corrected typo to wrong file (#6199)
semidark Mar 21, 2024
d0a7123
cuda : disable host register by default (#6206)
slaren Mar 21, 2024
be07a03
server : update readme doc from `slot_id` to `id_slot` (#6213)
kaetemi Mar 21, 2024
fa046ea
Fix params underscore convert to dash. (#6203)
dranger003 Mar 22, 2024
59c17f0
add blog link (#6222)
NeoZhangJianyu Mar 22, 2024
95d576b
metal : pad n_ctx by 32 (#6177)
ggerganov Mar 22, 2024
b2075fd
ci : add CURL flag for the mac builds (#6214)
Vaibhavs10 Mar 22, 2024
b3e94f2
metal : proper assert for mat-mat memory alignment (#6225)
ggerganov Mar 22, 2024
68e210b
server : enable continuous batching by default (#6231)
ggerganov Mar 22, 2024
6b8bb3a
server : fix n_keep always showing as 0 in response (#6211)
kaetemi Mar 22, 2024
29ab270
readme : add RecurseChat to the list of UIs (#6219)
xyc Mar 22, 2024
2f0e81e
cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy…
slaren Mar 22, 2024
72114ed
json-schema-to-grammar : fix order of props + non-str const/enum (#6232)
ochafik Mar 22, 2024
f77a8ff
tests : conditional python & node json schema tests (#6207)
ochafik Mar 22, 2024
e80f06d
llama : correction of the attn.v.weight quantization for IQ3_XS (#6209)
Nexesenex Mar 22, 2024
80bd33b
common : add HF arg helpers (#6234)
ggerganov Mar 22, 2024
ee804f6
ci: apply concurrency limit for github workflows (#6243)
mscheong01 Mar 22, 2024
dba1af6
llama_model_loader: support multiple split/shard GGUFs (#6187)
phymbert Mar 22, 2024
1d0331c
quantize: options for output and token embedding tensors qtype (#6239)
ikawrakow Mar 22, 2024
92397d8
convert-llama2c-to-ggml : enable conversion of GQA models (#6237)
fraxy-v Mar 22, 2024
56a00f0
common : default --hf-file to --model (#6234)
ggerganov Mar 22, 2024
50ccaf5
lookup: complement data from context with general text statistics (#5…
JohannesGaessler Mar 23, 2024
1b26aeb
server: flush stdout after logging in both text and json layout (#6253)
phymbert Mar 23, 2024
21cad01
split: add gguf-split in the make build target (#6262)
phymbert Mar 23, 2024
476b025
llama : add grok-1 support (#6204)
arki05 Mar 23, 2024
1997577
server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-di…
phymbert Mar 23, 2024
f482bb2
common: llama_load_model_from_url split support (#6192)
phymbert Mar 23, 2024
9556217
gitignore : gguf-split
ggerganov Mar 23, 2024
94d1b3b
use _wfopen instead of fopen on Windows (#6248)
cebtenzzre Mar 23, 2024
d03224a
Support build win release for SYCL (#6241)
NeoZhangJianyu Mar 24, 2024
ddf6568
[SYCL] offload op (#6217)
airMeng Mar 24, 2024
586e7bc
sampling : deduplicated code for probability distribution access (#6240)
mscheong01 Mar 24, 2024
ea279d5
ci : close inactive issue, increase operations per run (#6270)
phymbert Mar 24, 2024
7aed0ff
Fixed lookup compilation issues on Windows (#6273)
JohannesGaessler Mar 24, 2024
a0e584d
imatrix : fix wname for mul_mat_id ops (#6271)
ggerganov Mar 24, 2024
a32b77c
Fix heap corruption from wmode out-of-bound writes on windows (#6272)
TheFlipbook Mar 24, 2024
7733f0c
ggml : support AVX512VNNI (#6280)
jart Mar 25, 2024
64e7b47
examples : add "retrieval" (#6193)
mscheong01 Mar 25, 2024
95ad616
[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290)
NeoZhangJianyu Mar 25, 2024
ad3a050
Server: clean up OAI params parsing function (#6284)
ngxson Mar 25, 2024
ae1f211
cuda : refactor into multiple files (#6269)
slaren Mar 25, 2024
2f34b86
cuda : fix LLAMA_CUDA_F16 build (#6298)
slaren Mar 25, 2024
43139cc
flake.lock: Update (#6266)
ggerganov Mar 25, 2024
1f2fd4e
tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)
ikawrakow Mar 25, 2024
b06c16e
nix: fix blas support (#6281)
ck3d Mar 25, 2024
2803459
cuda : rename build flag to LLAMA_CUDA (#6299)
slaren Mar 26, 2024
e190f1f
nix: make `xcrun` visible in Nix sandbox for precompiling Metal shade…
josephst Mar 26, 2024
3d032ec
server : add `n_discard` parameter (#6300)
kaetemi Mar 26, 2024
deb7240
embedding : adjust `n_ubatch` value (#6296)
mscheong01 Mar 26, 2024
d25b1c3
quantize : be able to override metadata by key (#6321)
ikawrakow Mar 26, 2024
e097633
convert-hf : fix exception in sentencepiece with added tokens (#6320)
pcuenca Mar 26, 2024
55c1b2a
IQ1_M: 1.75 bpw quantization (#6302)
ikawrakow Mar 26, 2024
557410b
llama : greatly reduce output buffer memory usage (#6122)
compilade Mar 26, 2024
32c8486
wpm : portable unicode tolower (#6305)
cebtenzzre Mar 26, 2024
a4f569e
[SYCL] fix no file in win rel (#6314)
NeoZhangJianyu Mar 27, 2024
0642b22
server: public: use relative routes for static files (#6325)
EZForever Mar 27, 2024
1740d6d
readme : add php api bindings (#6326)
mcharytoniuk Mar 27, 2024
2ab4f00
llama2c : open file as binary (#6332)
ggerganov Mar 27, 2024
e562b97
common : change --no-penalize-nl to --penalize-nl (#6334)
CISC Mar 27, 2024
cbc8343
Make IQ1_M work for QK_K = 64 (#6327)
ikawrakow Mar 27, 2024
e82f9e2
[SYCL] Fix batched impl for NVidia GPU (#6164)
AidanBeltonS Mar 27, 2024
1e13987
embedding : show full embedding for single prompt (#6342)
howlger Mar 27, 2024
3a03459
make : whitespace
ggerganov Mar 27, 2024
e5b89a4
ggml : fix bounds checking of zero size views (#6347)
slaren Mar 27, 2024
53c7ec5
nix: ci: dont test cuda and rocm (for now)
SomeoneSerge Mar 27, 2024
a016026
server: continuous performance monitoring and PR comment (#6283)
phymbert Mar 27, 2024
25f4a61
[SYCL] fix set main gpu crash (#6339)
NeoZhangJianyu Mar 28, 2024
d0e2f64
doc: fix typo in MobileVLM-README.md (#6181)
ZiangWu-77 Mar 28, 2024
f6a0f5c
nix: .#widnows: init
hutli Feb 15, 2024
22a462c
nix: package: don't introduce the dependency on python
SomeoneSerge Mar 26, 2024
e9f17dc
nix: .#windows: proper cross-compilation set-up
SomeoneSerge Mar 26, 2024
dbb03e2
only using explicit blas if hostPlatform is allowed
hutli Mar 27, 2024
c873976
using blas.meta.available to check host platform
hutli Mar 27, 2024
d39b308
nix: moved blas availability check to package inputs so it is still o…
hutli Mar 27, 2024
d2d8f38
nix: removed unnessesary indentation
hutli Mar 27, 2024
6902cb7
server : stop gracefully on SIGTERM (#6348)
EZForever Mar 28, 2024
cfc4d75
doc: fix outdated default value of batch size (#6336)
Sunt-ing Mar 28, 2024
28cb9a0
ci: bench: fix master not schedule, fix commit status failed on exter…
phymbert Mar 28, 2024
0308f5e
llama : fix command-r inference when omitting outputs (#6367)
compilade Mar 28, 2024
66ba560
llava : fix MobileVLM (#6364)
ZiangWu-77 Mar 28, 2024
be55134
convert : refactor vocab selection logic (#6355)
cebtenzzre Mar 28, 2024
5106ef4
[SYCL] Revisited & updated SYCL build documentation (#6141)
OuadiElfarouki Mar 28, 2024
bfe7daf
readme : add notice for UI list
ggerganov Mar 28, 2024
b75c381
convert : allow conversion of Mistral HF models (#6144)
pcuenca Mar 29, 2024
057400a
llama : remove redundant reshape in build_kv_store (#6369)
danbev Mar 29, 2024
8093987
cmake : add explicit metal version options (#6370)
mattjcly Mar 29, 2024
b910287
readme : add project (#6356)
zhouwg Mar 29, 2024
cfde806
ci : fix BGE wget (#6383)
ggerganov Mar 29, 2024
0695747
[Model] Add support for xverse (#6301)
hxer7963 Mar 29, 2024
d48ccf3
sync : ggml (#6351)
ggerganov Mar 29, 2024
ba0c7c7
Vulkan k-quant mmq and ggml-backend offload functionality (#6155)
0cc4m Mar 29, 2024
f7fc5f6
split: allow --split-max-size option (#6343)
ngxson Mar 29, 2024
c342d07
Fedora build update (#6388)
Man2Dev Mar 29, 2024
37e7854
ci: bench: fix Resource not accessible by integration on PR event (#6…
phymbert Mar 30, 2024
c50a82c
readme : update hot topics
ggerganov Mar 31, 2024
226e819
ci: server: verify deps are coherent with the commit (#6409)
phymbert Apr 1, 2024
33a5244
compare-llama-bench.py: fix long hexsha args (#6424)
JohannesGaessler Apr 1, 2024
f87f7b8
flake.lock: Update (#6402)
ggerganov Apr 1, 2024
5260486
[SYCL] Disable iqx on windows as WA (#6435)
airMeng Apr 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .clang-tidy
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Checks: >
-readability-implicit-bool-conversion,
-readability-magic-numbers,
-readability-uppercase-literal-suffix,
-readability-simplify-boolean-expr,
clang-analyzer-*,
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,
performance-*,
Expand Down
4 changes: 2 additions & 2 deletions .devops/full-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1

RUN make

Expand Down
2 changes: 1 addition & 1 deletion .devops/llama-cpp-clblast.srpm.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - [email protected]
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - [email protected]
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand All @@ -12,7 +12,7 @@
# 4. OpenCL/CLBLAST support simply requires the ICD loader and basic opencl libraries.
# It is up to the user to install the correct vendor-specific support.

Name: llama.cpp-cublas
Name: llama.cpp-cuda
Version: %( date "+%%Y%%m%%d" )
Release: 1%{?dist}
Summary: CPU Inference of LLaMA model in pure C/C++ (no CUDA/OpenCL)
Expand All @@ -32,24 +32,24 @@ CPU inference for Meta's Lllama2 models using default options.
%setup -n llama.cpp-master

%build
make -j LLAMA_CUBLAS=1
make -j LLAMA_CUDA=1

%install
mkdir -p %{buildroot}%{_bindir}/
cp -p main %{buildroot}%{_bindir}/llamacppcublas
cp -p server %{buildroot}%{_bindir}/llamacppcublasserver
cp -p simple %{buildroot}%{_bindir}/llamacppcublassimple
cp -p main %{buildroot}%{_bindir}/llamacppcuda
cp -p server %{buildroot}%{_bindir}/llamacppcudaserver
cp -p simple %{buildroot}%{_bindir}/llamacppcudasimple

mkdir -p %{buildroot}/usr/lib/systemd/system
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamacublas.service
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamacuda.service
[Unit]
Description=Llama.cpp server, CPU only (no GPU support in this build).
After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.target

[Service]
Type=simple
EnvironmentFile=/etc/sysconfig/llama
ExecStart=/usr/bin/llamacppcublasserver $LLAMA_ARGS
ExecStart=/usr/bin/llamacppcudaserver $LLAMA_ARGS
ExecReload=/bin/kill -s HUP $MAINPID
Restart=never

Expand All @@ -67,10 +67,10 @@ rm -rf %{buildroot}
rm -rf %{_builddir}/*

%files
%{_bindir}/llamacppcublas
%{_bindir}/llamacppcublasserver
%{_bindir}/llamacppcublassimple
/usr/lib/systemd/system/llamacublas.service
%{_bindir}/llamacppcuda
%{_bindir}/llamacppcudaserver
%{_bindir}/llamacppcudasimple
/usr/lib/systemd/system/llamacuda.service
%config /etc/sysconfig/llama

%pre
Expand Down
2 changes: 1 addition & 1 deletion .devops/llama-cpp.srpm.spec
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - [email protected]
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

Expand Down
4 changes: 2 additions & 2 deletions .devops/main-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1

RUN make

Expand Down
48 changes: 35 additions & 13 deletions .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@
config,
stdenv,
mkShell,
runCommand,
cmake,
ninja,
pkg-config,
git,
python3,
mpi,
openblas, # TODO: Use the generic `blas` so users could switch between alternative implementations
blas,
cudaPackages,
darwin,
rocmPackages,
Expand All @@ -23,7 +24,7 @@
useOpenCL
useRocm
useVulkan
],
] && blas.meta.available,
useCuda ? config.cudaSupport,
useMetalKit ? stdenv.isAarch64 && stdenv.isDarwin && !useOpenCL,
useMpi ? false, # Increases the runtime closure size by ~700M
Expand All @@ -35,7 +36,8 @@
# It's necessary to consistently use backendStdenv when building with CUDA support,
# otherwise we get libstdc++ errors downstream.
effectiveStdenv ? if useCuda then cudaPackages.backendStdenv else stdenv,
enableStatic ? effectiveStdenv.hostPlatform.isStatic
enableStatic ? effectiveStdenv.hostPlatform.isStatic,
precompileMetalShaders ? false
}@inputs:

let
Expand Down Expand Up @@ -65,10 +67,15 @@ let
strings.optionalString (suffices != [ ])
", accelerated with ${strings.concatStringsSep ", " suffices}";

executableSuffix = effectiveStdenv.hostPlatform.extensions.executable;

# TODO: package the Python in this repository in a Nix-like way.
# It'd be nice to migrate to buildPythonPackage, as well as ensure this repo
# is PEP 517-compatible, and ensure the correct .dist-info is generated.
# https://peps.python.org/pep-0517/
#
# TODO: Package up each Python script or service appropriately, by making
# them into "entrypoints"
llama-python = python3.withPackages (
ps: [
ps.numpy
Expand All @@ -87,6 +94,11 @@ let
]
);

xcrunHost = runCommand "xcrunHost" {} ''
mkdir -p $out/bin
ln -s /usr/bin/xcrun $out/bin
'';

# apple_sdk is supposed to choose sane defaults, no need to handle isAarch64
# separately
darwinBuildInputs =
Expand Down Expand Up @@ -150,13 +162,18 @@ effectiveStdenv.mkDerivation (
postPatch = ''
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"ggml-metal" ofType:@"metal"];' "@\"$out/bin/ggml-metal.metal\";"

# TODO: Package up each Python script or service appropriately.
# If we were to migrate to buildPythonPackage and prepare the `pyproject.toml`,
# we could make those *.py into setuptools' entrypoints
substituteInPlace ./*.py --replace "/usr/bin/env python" "${llama-python}/bin/python"
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"default" ofType:@"metallib"];' "@\"$out/bin/default.metallib\";"
'';

# With PR#6015 https://github.com/ggerganov/llama.cpp/pull/6015,
# `default.metallib` may be compiled with Metal compiler from XCode
# and we need to escape sandbox on MacOS to access Metal compiler.
# `xcrun` is used find the path of the Metal compiler, which is varible
# and not on $PATH
# see https://github.com/ggerganov/llama.cpp/pull/6118 for discussion
__noChroot = effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders;

nativeBuildInputs =
[
cmake
Expand All @@ -173,6 +190,8 @@ effectiveStdenv.mkDerivation (
]
++ optionals (effectiveStdenv.hostPlatform.isGnu && enableStatic) [
glibc.static
] ++ optionals (effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders) [
xcrunHost
];

buildInputs =
Expand All @@ -181,6 +200,7 @@ effectiveStdenv.mkDerivation (
++ optionals useMpi [ mpi ]
++ optionals useOpenCL [ clblast ]
++ optionals useRocm rocmBuildInputs
++ optionals useBlas [ blas ]
++ optionals useVulkan vulkanBuildInputs;

cmakeFlags =
Expand All @@ -191,7 +211,7 @@ effectiveStdenv.mkDerivation (
(cmakeBool "CMAKE_SKIP_BUILD_RPATH" true)
(cmakeBool "LLAMA_BLAS" useBlas)
(cmakeBool "LLAMA_CLBLAST" useOpenCL)
(cmakeBool "LLAMA_CUBLAS" useCuda)
(cmakeBool "LLAMA_CUDA" useCuda)
(cmakeBool "LLAMA_HIPBLAS" useRocm)
(cmakeBool "LLAMA_METAL" useMetalKit)
(cmakeBool "LLAMA_MPI" useMpi)
Expand All @@ -216,14 +236,16 @@ effectiveStdenv.mkDerivation (
# Should likely use `rocmPackages.clr.gpuTargets`.
"-DAMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102"
]
++ optionals useMetalKit [ (lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1") ]
++ optionals useBlas [ (lib.cmakeFeature "LLAMA_BLAS_VENDOR" "OpenBLAS") ];
++ optionals useMetalKit [
(lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1")
(cmakeBool "LLAMA_METAL_EMBED_LIBRARY" (!precompileMetalShaders))
];

# TODO(SomeoneSerge): It's better to add proper install targets at the CMake level,
# if they haven't been added yet.
postInstall = ''
mv $out/bin/main $out/bin/llama
mv $out/bin/server $out/bin/llama-server
mv $out/bin/main${executableSuffix} $out/bin/llama${executableSuffix}
mv $out/bin/server${executableSuffix} $out/bin/llama-server${executableSuffix}
mkdir -p $out/include
cp $src/llama.h $out/include/
'';
Expand Down
4 changes: 2 additions & 2 deletions .devops/server-cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable cuBLAS
ENV LLAMA_CUBLAS=1
# Enable CUDA
ENV LLAMA_CUDA=1

RUN make

Expand Down
Loading
Loading