Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Runner crash in Ollama when offloading multiple layers #12513

Open
pauleseifert opened this issue Dec 8, 2024 · 5 comments
Open

GPU Runner crash in Ollama when offloading multiple layers #12513

pauleseifert opened this issue Dec 8, 2024 · 5 comments

Comments

@pauleseifert
Copy link

Hi,

I experience crashes of the gpu runner when offloading multiple layers to the gpu.

time=2024-12-09T00:58:03.646+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding"
time=2024-12-09T00:58:04.348+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: bus error (core dumped)"
[GIN] 2024/12/09 - 00:58:04 | 500 |  1.520528721s |      172.16.6.3 | POST     "/api/chat"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff

It seems to work for one layer. The error message is not really helpful. The GPU is small (4gb A310) but so is the model (llama [email protected] params., 1.87 GiB model size). VRAM shouldn't be the problem.

I use docker on Debian on kernel 6.6.44 with the following docker compose:

  ipex-llm:
    image: intelanalytics/ipex-llm-inference-cpp-xpu:latest
    container_name: ollama
    restart: unless-stopped
    networks:
       - backend
    command: >
      /bin/bash -c "
        sycl-ls &&
        source ipex-llm-init --gpu --device Arc &&

        bash ./scripts/start-ollama.sh && # run the scripts
        kill $(pgrep -f ollama) && # kill background ollama
        /llm/ollama/ollama serve # run foreground ollama
      "
    devices:
      - /dev/dri
    volumes:
      - /dev/dri:/dev/dri
      - /mnt/fast_storage/docker/ollama:/root/.ollama
    environment:
      DEVICE: Arc
      NEOReadDebugKeys: 1
      OverrideGpuAddressSpace: 48
      ZES_ENABLE_SYSMAN: 1
      OLLAMA_DEBUG: 1
      #OLLAMA_INTEL_GPU: 1
      OLLAMA_NUM_PARALLEL: 1
      OLLAMA_HOST: 0.0.0.0
      OLLAMA_NUM_GPU: 999 # layers to offload -> this is the problem 
      SYCL_CACHE_PERSISTENT: 1
      SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS: 1
      ONEAPI_DEVICE_SELECTOR: level_zero=gpu:0 

Any ideas for further debugging? Full logs below.

Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero=gpu:0.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A310 LP Graphics 1.6 [1.3.31294]
found oneapi in /opt/intel/oneapi/setvars.sh
 
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
 
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
+++++ Env Variables +++++
�������
:
    ENABLE_IOMP     = 1
    ENABLE_GPU      = 1
    ENABLE_JEMALLOC = 0
    ENABLE_TCMALLOC = 0
    LIB_DIR    = /usr/local/lib
    BIN_DIR    = bin64
    LLM_DIR    = /usr/local/lib/python3.11/dist-packages/ipex_llm
�������
:
    LD_PRELOAD             = 
    OMP_NUM_THREADS        = 
    MALLOC_CONF            = 
    USE_XETLA              = OFF
    ENABLE_SDP_FUSION      = 
    SYCL_CACHE_PERSISTENT  = 1
    BIGDL_LLM_XMX_DISABLED = 
    SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
�������
.
2024/12/09 00:57:46 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-12-09T00:57:46.256+08:00 level=INFO source=images.go:753 msg="total blobs: 42"
time=2024-12-09T00:57:46.257+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
time=2024-12-09T00:57:46.257+08:00 level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6-ipexllm-20241204)"
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-12-09T00:57:46.257+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2506652849/runners
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/libggml.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/libllama.so.gz
time=2024-12-09T00:57:46.258+08:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:57:46.395+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-12-09T00:57:46.395+08:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
[GIN] 2024/12/09 - 00:57:58 | 200 |    1.402222ms |      172.16.6.3 | GET      "/api/tags"
[GIN] 2024/12/09 - 00:57:58 | 200 |      64.402µs |      172.16.6.3 | GET      "/api/version"
[GIN] 2024/12/09 - 00:58:02 | 200 |    1.948256ms |      172.16.6.3 | GET      "/api/tags"
time=2024-12-09T00:58:02.875+08:00 level=INFO source=gpu.go:168 msg="looking for compatible GPUs"
time=2024-12-09T00:58:02.875+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.875+08:00 level=DEBUG source=gpu.go:79 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-12-09T00:58:02.875+08:00 level=DEBUG source=gpu.go:382 msg="Searching for GPU library" name=libcuda.so*
time=2024-12-09T00:58:02.876+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.876+08:00 level=DEBUG source=gpu.go:405 msg="gpu library search" globs="[libcuda.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcuda.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcuda.so* /opt/intel/oneapi/mpi/2021.11/lib/libcuda.so* /opt/intel/oneapi/mkl/2024.0/lib/libcuda.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcuda.so* /opt/intel/oneapi/ipp/2021.10/lib/libcuda.so* /opt/intel/oneapi/dpl/2022.3/lib/libcuda.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcuda.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcuda.so* /opt/intel/oneapi/dal/2024.0/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/lib/libcuda.so* /opt/intel/oneapi/ccl/2021.11/lib/libcuda.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcuda.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcuda.so* /opt/intel/oneapi/mpi/2021.11/lib/libcuda.so* /opt/intel/oneapi/mkl/2024.0/lib/libcuda.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcuda.so* /opt/intel/oneapi/ipp/2021.10/lib/libcuda.so* /opt/intel/oneapi/dpl/2022.3/lib/libcuda.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcuda.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcuda.so* /opt/intel/oneapi/dal/2024.0/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcuda.so* /opt/intel/oneapi/compiler/2024.0/lib/libcuda.so* /opt/intel/oneapi/ccl/2021.11/lib/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:439 msg="discovered GPU libraries" paths=[]
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:382 msg="Searching for GPU library" name=libcudart.so*
time=2024-12-09T00:58:02.880+08:00 level=WARN source=gpu.go:560 msg="unable to locate gpu dependency libraries"
time=2024-12-09T00:58:02.880+08:00 level=DEBUG source=gpu.go:405 msg="gpu library search" globs="[libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so* /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so* /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so* /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so* /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so* /opt/intel/oneapi/dal/2024.0/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so* /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so* /opt/intel/oneapi/tbb/2021.11/lib/intel64/gcc4.8/libcudart.so* /opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/libcudart.so* /opt/intel/oneapi/mpi/2021.11/lib/libcudart.so* /opt/intel/oneapi/mkl/2024.0/lib/libcudart.so* /opt/intel/oneapi/ippcp/2021.9/lib/libcudart.so* /opt/intel/oneapi/ipp/2021.10/lib/libcudart.so* /opt/intel/oneapi/dpl/2022.3/lib/libcudart.so* /opt/intel/oneapi/dnnl/2024.0/lib/libcudart.so* /opt/intel/oneapi/debugger/2024.0/opt/debugger/lib/libcudart.so* /opt/intel/oneapi/dal/2024.0/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/opt/compiler/lib/libcudart.so* /opt/intel/oneapi/compiler/2024.0/lib/libcudart.so* /opt/intel/oneapi/ccl/2021.11/lib/libcudart.so* /tmp/ollama2506652849/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=gpu.go:439 msg="discovered GPU libraries" paths=[]
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-12-09T00:58:02.882+08:00 level=INFO source=gpu.go:280 msg="no compatible GPUs were discovered"
time=2024-12-09T00:58:02.882+08:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x83a520 gpu_count=1
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=sched.go:211 msg="cpu mode with first model, loading"
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=server.go:101 msg="system memory" total="62.7 GiB" free="23.5 GiB" free_swap="0 B"
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:58:02.941+08:00 level=DEBUG source=memory.go:101 msg=evaluating library=cpu gpu_count=1 available="[23.5 GiB]"
time=2024-12-09T00:58:02.941+08:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[23.5 GiB]" memory.required.full="2.3 GiB" memory.required.partial="0 B" memory.required.kv="224.0 MiB" memory.required.allocations="[2.3 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="124.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu/ollama_llama_server
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx/ollama_llama_server
time=2024-12-09T00:58:02.942+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server
time=2024-12-09T00:58:02.943+08:00 level=DEBUG source=gpu.go:531 msg="no filter required for library cpu"
time=2024-12-09T00:58:02.943+08:00 level=INFO source=server.go:395 msg="starting llama server" cmd="/tmp/ollama2506652849/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --verbose --no-mmap --parallel 1 --port 41009"
time=2024-12-09T00:58:02.943+08:00 level=DEBUG source=server.go:412 msg=subprocess environment="[LD_LIBRARY_PATH=/tmp/ollama2506652849/runners/cpu_avx2:/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/:/opt/intel/oneapi/tbb/2021.11/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.11/lib:/opt/intel/oneapi/mkl/2024.0/lib:/opt/intel/oneapi/ippcp/2021.9/lib/:/opt/intel/oneapi/ipp/2021.10/lib:/opt/intel/oneapi/dpl/2022.3/lib:/opt/intel/oneapi/dnnl/2024.0/lib:/opt/intel/oneapi/debugger/2024.0/opt/debugger/lib:/opt/intel/oneapi/dal/2024.0/lib:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/host/linux64/lib:/opt/intel/oneapi/compiler/2024.0/opt/compiler/lib:/opt/intel/oneapi/compiler/2024.0/lib:/opt/intel/oneapi/ccl/2021.11/lib/ PATH=/opt/intel/oneapi/vtune/2024.0/bin64:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/opt/intel/oneapi/mpi/2021.11/bin:/opt/intel/oneapi/mkl/2024.0/bin/:/opt/intel/oneapi/dpcpp-ct/2024.0/bin:/opt/intel/oneapi/dev-utilities/2024.0/bin:/opt/intel/oneapi/debugger/2024.0/opt/debugger/bin:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/bin:/opt/intel/oneapi/compiler/2024.0/bin:/opt/intel/oneapi/advisor/2024.0/bin64:/opt/intel/oneapi/vtune/2024.0/bin64:/opt/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/opt/intel/oneapi/mpi/2021.11/bin:/opt/intel/oneapi/mkl/2024.0/bin/:/opt/intel/oneapi/dpcpp-ct/2024.0/bin:/opt/intel/oneapi/dev-utilities/2024.0/bin:/opt/intel/oneapi/debugger/2024.0/opt/debugger/bin:/opt/intel/oneapi/compiler/2024.0/opt/oclfpga/bin:/opt/intel/oneapi/compiler/2024.0/bin:/opt/intel/oneapi/advisor/2024.0/bin64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin]"
time=2024-12-09T00:58:02.944+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2024-12-09T00:58:02.944+08:00 level=INFO source=server.go:595 msg="waiting for llama runner to start responding"
time=2024-12-09T00:58:02.944+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="f711d1d" tid="140603310369792" timestamp=1733677082
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140603310369792" timestamp=1733677082 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="41009" tid="140603310369792" timestamp=1733677082
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-12-09T00:58:03.195+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1918.36 MiB
llm_load_tensors:  SYCL_Host buffer size =   308.23 MiB
time=2024-12-09T00:58:03.646+08:00 level=INFO source=server.go:629 msg="waiting for server to become available" status="llm server not responding"
time=2024-12-09T00:58:04.348+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: bus error (core dumped)"
[GIN] 2024/12/09 - 00:58:04 | 500 |  1.520528721s |      172.16.6.3 | POST     "/api/chat"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:459 msg="triggering expiration for failed load" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:376 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=server.go:1052 msg="stopping llama server"
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:381 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
time=2024-12-09T00:58:04.348+08:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

@sgwhat
Copy link
Contributor

sgwhat commented Dec 9, 2024

Hi @pauleseifert. I think this should be an OOM issue, you may try to set OLLAMA_PARALLEL=1 before you start ollama serve to reduce memory usage.

@pauleseifert
Copy link
Author

Hi @sgwhat. I agree, that's what it looks like. ENV OLLAMA_NUM_PARALLEL=1 is, however, already set in my docker compose file. Any other ideas?

@sgwhat
Copy link
Contributor

sgwhat commented Dec 10, 2024

  1. Sorry for the typo error, it should be OLLAMA_PARALLEL=1 instead of OLLAMA_NUM_PARALLEL.
  2. Could you please check and provide your GPU memory usage when running Ollama?

@pauleseifert
Copy link
Author

pauleseifert commented Dec 16, 2024

This doesn't help. The runner still crashes. intel_gpu_top showed normal behavior for the short moment the runner was visible. There are no other processes running so all memory should be available.

@sgwhat
Copy link
Contributor

sgwhat commented Dec 17, 2024

Can you provide the memory usage before and after running ollama run <model>? This can help us resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants