bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

nguyenhoangthuan99 · 2024-10-01T02:00:20Z

Cortex version

cortex-nightly

Describe the Bug

I think its a separate cause - Insufficient Memory - In that case can we gracefully handle it?

Cortex.log:

 - llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)```

Fuller Log for the above, last few lines

20240930 11:02:44.357459 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.357479 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.358381 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.358398 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.359336 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.359354 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.360442 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.360466 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.361376 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.362297 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.362316 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.362803 UTC 8296618 INFO  slot released: id_slot: 0, id_task: 3, n_ctx: 131072, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:02:44.362834 UTC 8296619 INFO  Request 7: End of result - llama_engine.cc:785
20240930 11:02:44.362865 UTC 8296619 INFO  Request 7: Task completed, release it - llama_engine.cc:818
20240930 11:02:44.362870 UTC 8296619 INFO  Request 7: Inference completed - llama_engine.cc:832
20240930 11:03:01.100697 UTC 8296353 INFO  Model status responded - llama_engine.cc:337
20240930 11:03:01.103107 UTC 8296355 INFO  Model status responded - llama_engine.cc:337
20240930 11:03:04.615093 UTC 8296356 INFO  Request 8, model Llama-3.2-3B-Instruct: Generating response for inference request - llama_engine.cc:591
20240930 11:03:04.616089 UTC 8296356 INFO  Request 8: Stop words:[
	"<|eot_id|>"
]
 - llama_engine.cc:608
20240930 11:03:04.616203 UTC 8296356 INFO  Request 8: Streamed, waiting for respone - llama_engine.cc:748
20240930 11:03:04.616312 UTC 8367882 INFO  slot released: id_slot: 0, id_task: 1, n_ctx: 4096, n_past: 50, n_system_tokens: 0, n_cache_tokens: 50, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:04.617392 UTC 8367882 INFO  kv cache rm [p0, end) -  id_slot: 0, task_id: 3, p0: 6 - llama_server_context.cc:1549
20240930 11:03:13.363045 UTC 8296356 FATAL Socket is not connected (errno=57) sockets::shutdownWrite - Socket.cc:110
20240930 11:03:30.904436 UTC 8367882 INFO  slot released: id_slot: 0, id_task: 3, n_ctx: 4096, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:30.904556 UTC 8367883 INFO  Request 8: End of result - llama_engine.cc:785
20240930 11:03:30.905464 UTC 8367883 INFO  Request 8: Task completed, release it - llama_engine.cc:818
20240930 11:03:30.905533 UTC 8367883 INFO  Request 8: Inference completed - llama_engine.cc:832

Problem
The root cause of Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) issue is: the machine doesn't have enough memory to run the model. The default context length of llama3.1 is 131072 -> requires 16 GB of RAM/VRAM to load kv cache. This is bug of llamacpp, with CUDA or CPU version, when machine doesn't have enough resouce to load model, it will throw error. But the metal backend, llamacpp don't throw anything, just silent log.

Solution
This issue related to #1108, we should recommend the proper context length base on user hardware. Maybe we should create other ticket to track this. The current solution is set max-context length of every model to 8192 (which is average conversation length - llama3 also has 8192 max context length) -> only requires 1 GB for KV cache.

Steps to Reproduce

No response

Screenshots / Logs

No response

What is your OS?

MacOS
Windows
Linux

What engine are you running?

cortex.llamacpp (default)
cortex.tensorrt-llm (Nvidia GPUs)
cortex.onnx (NPUs, DirectML)

The text was updated successfully, but these errors were encountered:

gabrielle-ong · 2024-10-01T09:55:02Z

Nicely Solved @nguyenhoangthuan99! above model runs on Mac without insufficient memory error

nguyenhoangthuan99 added the type: bug Something isn't working label Oct 1, 2024

nguyenhoangthuan99 self-assigned this Oct 1, 2024

nguyenhoangthuan99 added this to Menlo Oct 1, 2024

github-project-automation bot moved this to Investigating in Menlo Oct 1, 2024

nguyenhoangthuan99 moved this from Investigating to In Progress in Menlo Oct 1, 2024

This was referenced Oct 1, 2024

question: Error render chat template: bad_expected_access. #1304

Closed

Fix/max context length #1370

Merged

nguyenhoangthuan99 closed this as completed in #1370 Oct 1, 2024

github-project-automation bot moved this from In Progress to Review + QA in Menlo Oct 1, 2024

gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 1, 2024

gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

nguyenhoangthuan99 commented Oct 1, 2024 •

edited

Loading

gabrielle-ong commented Oct 1, 2024

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

Comments

nguyenhoangthuan99 commented Oct 1, 2024 • edited Loading

Cortex version

Describe the Bug

Steps to Reproduce

Screenshots / Logs

What is your OS?

What engine are you running?

gabrielle-ong commented Oct 1, 2024

nguyenhoangthuan99 commented Oct 1, 2024 •

edited

Loading