Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: can not load model if btrfs subvolume is in path #658

Open
mounta11n opened this issue Dec 17, 2024 · 0 comments
Open

Bug: can not load model if btrfs subvolume is in path #658

mounta11n opened this issue Dec 17, 2024 · 0 comments

Comments

@mounta11n
Copy link

mounta11n commented Dec 17, 2024

Contact Details

[email protected]

What happened?

Whenever I try to load/run a model by giving an absolute path to model – which is stored on a btrfs subvolume – llamafile can not load the model and instead says that the path is a directory.

This does only happen if the btrfs subvolume is part of the given path, like this for example won't work:

llamafile --verbose \
-m /run/media/yazan/NVME-2TB/@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

leads to:

/run/media/yazan/NVME-2TB/@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: failed to load model

It will also not work for this:

cd /run/media/yazan/NVME-2TB

llamafile --verbose -m ./@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

again:

./@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: failed to load model

But changing directory into the subvolume and give the relative path from there works fine:

cd /run/media/yazan/NVME-2TB/@data-models

llamafile --verbose -m ./llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

now i have:

note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
llama_model_loader: loaded meta data with 25 key-value pairs and 464 tensors from ./llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

#...

INFO [ server_cli] HTTP server listening | hostname="127.0.0.1" port="8080" tid="139674262733904" timestamp=1734423119 url_prefix=""
software: llamafile 0.8.17
model: gemma-2-9b-Q8.0.gguf
mode: RAW TEXT COMPLETION (base model)
compute: Intel Core i9-9900KF CPU @ 3.60GHz (skylake)
server: http://127.0.0.1:8080/

#...

type text to be completed (or /help for help)

Version

llamafile v0.8.17

What operating system are you seeing the problem on?

Linux

Relevant log output

Here are some specs:

openSUSE Tumbleweed 20241211 x86_64
Linux 6.11.8-1-default
bash 5.2.37

And as mentioned I use btrfs as my filesystem – for both for my secondary data storage as well as for the os

$ sudo btrfs --version
btrfs-progs v6.10.1
-EXPERIMENTAL -INJECT -STATIC +LZO +ZSTD +UDEV +FSVERITY +ZONED CRYPTO=builtin

And here the llamafile outputs:

llamafile --verbose -m /run/media/yazan/NVME-2TB/@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
/run/media/yazan/NVME-2TB/: warning: failed to read last 64kb of file: Is a directory
llama_model_load: error loading model: failed to open /run/media/yazan/NVME-2TB/@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: Is a directory
llama_load_model_from_file: failed to load model
/run/media/yazan/NVME-2TB/@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: failed to load model

change to NVME-2TB:

cd /run/media/yazan/NVME-2TB && llamafile --verbose -m ./@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
./: warning: failed to read last 64kb of file: Is a directory
llama_model_load: error loading model: failed to open ./@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: Is a directory
llama_load_model_from_file: failed to load model
./@data-models/llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf: failed to load model

change to subvolume @data-models:

cd /run/media/yazan/NVME-2TB/@data-models && llamafile --verbose -m ./llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
llama_model_loader: loaded meta data with 25 key-value pairs and 464 tensors from ./llms/google/gemma-2-9b/gemma-2-9b-Q8.0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q8_0:  295 tensors
llm_load_vocab: special tokens cache size = 6
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 9.15 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors:        CPU buffer size =  9366.12 MiB
....................................................................................
INFO [              server_cli] build info | build=1500 commit="a30b324" tid="139674262733904" timestamp=1734423119
INFO [              server_cli] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139674262733904" timestamp=1734423119 total_threads=16
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2688.00 MiB
llama_new_context_with_model: KV self size  = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:        CPU compute buffer size =   507.00 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 1
INFO [              initialize] initializing slots | n_slots=1 tid="139674262733904" timestamp=1734423119
INFO [              initialize] new slot | n_ctx_slot=8192 slot_id=0 tid="139674262733904" timestamp=1734423119
INFO [              server_cli] model loaded | tid="139674262733904" timestamp=1734423119

llama server listening at http://127.0.0.1:8080

INFO [              server_cli] HTTP server listening | hostname="127.0.0.1" port="8080" tid="139674262733904" timestamp=1734423119 url_prefix=""
software: llamafile 0.8.17
model:    gemma-2-9b-Q8.0.gguf
mode:     RAW TEXT COMPLETION (base model)
compute:  Intel Core i9-9900KF CPU @ 3.60GHz (skylake)
server:   http://127.0.0.1:8080/

llama_new_context_with_model: n_ctx      = 8192
VERB [              start_loop] new task may arrive | tid="139674262733904" timestamp=1734423119
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
VERB [              start_loop] callback_all_task_finished | tid="139674262733904" timestamp=1734423119
llama_new_context_with_model: freq_scale = 1
INFO [            update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="139674262733904" timestamp=1734423119
VERB [              start_loop] wait for new task | tid="139674262733904" timestamp=1734423120
llama_kv_cache_init:        CPU KV buffer size =  2688.00 MiB
llama_new_context_with_model: KV self size  = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:        CPU compute buffer size =   253.50 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 1

end
edit: typos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant