b3431 #252

Nexesenex · 2024-07-21T10:09:36Z

No description provided.

* CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build

* convert_hf : fix Gemma v1 conversion * convert_hf : allow renaming tokens, but with a warning * convert_hf : fix Gemma v1 not setting BOS and EOS tokens

* gguf-py : fix some metadata name extraction edge cases * convert_lora : use the lora dir for the model card path * gguf-py : more metadata edge cases fixes Multiple finetune versions are now joined together, and the removal of the basename annotation on trailing versions is more robust. * gguf-py : add more name metadata extraction tests * convert_lora : fix default filename The default filename was previously hardcoded. * convert_hf : Model.fname_out can no longer be None * gguf-py : do not use title case for naming convention Some models use acronyms in lowercase, which can't be title-cased like other words, so it's best to simply use the same case as in the original model name. Note that the size label still has an uppercased suffix to make it distinguishable from the context size of a finetune.

Changes: - Move each example into its own function. This makes the code much easier to read and understand. - Make the program easy to only run one test by commenting out function calls in main(). - Make the output easy to parse by indenting the output for each example. - Add shebang and +x bit to make it clear it's an executable. - Make the host configurable via --host with a default 127.0.0.1:8080. - Make the code look in the tools list to call the registered tool, instead of hardcoding the returned values. This makes the code more copy-pastable. - Add error checking, so that the program exits 1 if the LLM didn't returned expected values. It's super useful to check for correctness. Testing: - Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and Meta-Llama-3-8B-Instruct in F16 and Q5_K_M. - I did not observe a failure even once in Mistral-7B-Instruct-v0.3. - Llama-3 failed about a third of the time in example_concurrent: it only returned one call instead of 3. Even for F16. Potential follow ups: - Do not fix the prompt encoding yet. Surprisingly it mostly works even if the prompt encoding is not model optimized. - Add chained answer and response. Test only change.

JohannesGaessler and others added 4 commits July 20, 2024 22:25

CUDA: MMQ code deduplication + iquant support (#8495)

69c487f

* CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build

convert_hf : fix Gemma v1 conversion (#8597)

c69c630

* convert_hf : fix Gemma v1 conversion * convert_hf : allow renaming tokens, but with a warning * convert_hf : fix Gemma v1 not setting BOS and EOS tokens

Nexesenex merged commit 3a21988 into Nexesenex:spacestream Jul 21, 2024
31 of 36 checks passed

github-actions bot added Nvidia GPU examples python devops labels Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b3431 #252

b3431 #252

Nexesenex commented Jul 21, 2024

b3431 #252

b3431 #252

Conversation

Nexesenex commented Jul 21, 2024