You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think its a separate cause - Insufficient Memory - In that case can we gracefully handle it?
Cortex.log:
- llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)```
Fuller Log for the above, last few lines
20240930 11:02:44.357459 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.357479 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.358381 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.358398 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.359336 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.359354 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.360442 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.360466 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.361376 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.362297 UTC 8296618 INFO ggml_metal_graph_compute: command buffer 0 failed with status 5
- llama_engine.cc:395
20240930 11:02:44.362316 UTC 8296618 INFO error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
- llama_engine.cc:395
20240930 11:02:44.362803 UTC 8296618 INFO slot released: id_slot: 0, id_task: 3, n_ctx: 131072, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:02:44.362834 UTC 8296619 INFO Request 7: End of result - llama_engine.cc:785
20240930 11:02:44.362865 UTC 8296619 INFO Request 7: Task completed, release it - llama_engine.cc:818
20240930 11:02:44.362870 UTC 8296619 INFO Request 7: Inference completed - llama_engine.cc:832
20240930 11:03:01.100697 UTC 8296353 INFO Model status responded - llama_engine.cc:337
20240930 11:03:01.103107 UTC 8296355 INFO Model status responded - llama_engine.cc:337
20240930 11:03:04.615093 UTC 8296356 INFO Request 8, model Llama-3.2-3B-Instruct: Generating response for inference request - llama_engine.cc:591
20240930 11:03:04.616089 UTC 8296356 INFO Request 8: Stop words:[
"<|eot_id|>"
]
- llama_engine.cc:608
20240930 11:03:04.616203 UTC 8296356 INFO Request 8: Streamed, waiting for respone - llama_engine.cc:748
20240930 11:03:04.616312 UTC 8367882 INFO slot released: id_slot: 0, id_task: 1, n_ctx: 4096, n_past: 50, n_system_tokens: 0, n_cache_tokens: 50, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:04.617392 UTC 8367882 INFO kv cache rm [p0, end) - id_slot: 0, task_id: 3, p0: 6 - llama_server_context.cc:1549
20240930 11:03:13.363045 UTC 8296356 FATAL Socket is not connected (errno=57) sockets::shutdownWrite - Socket.cc:110
20240930 11:03:30.904436 UTC 8367882 INFO slot released: id_slot: 0, id_task: 3, n_ctx: 4096, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:30.904556 UTC 8367883 INFO Request 8: End of result - llama_engine.cc:785
20240930 11:03:30.905464 UTC 8367883 INFO Request 8: Task completed, release it - llama_engine.cc:818
20240930 11:03:30.905533 UTC 8367883 INFO Request 8: Inference completed - llama_engine.cc:832
Problem
The root cause of Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) issue is: the machine doesn't have enough memory to run the model. The default context length of llama3.1 is 131072 -> requires 16 GB of RAM/VRAM to load kv cache. This is bug of llamacpp, with CUDA or CPU version, when machine doesn't have enough resouce to load model, it will throw error. But the metal backend, llamacpp don't throw anything, just silent log.
Solution
This issue related to #1108, we should recommend the proper context length base on user hardware. Maybe we should create other ticket to track this. The current solution is set max-context length of every model to 8192 (which is average conversation length - llama3 also has 8192 max context length) -> only requires 1 GB for KV cache.
Steps to Reproduce
No response
Screenshots / Logs
No response
What is your OS?
MacOS
Windows
Linux
What engine are you running?
cortex.llamacpp (default)
cortex.tensorrt-llm (Nvidia GPUs)
cortex.onnx (NPUs, DirectML)
The text was updated successfully, but these errors were encountered:
Cortex version
cortex-nightly
Describe the Bug
Report from @gabrielle-ong
I think its a separate cause - Insufficient Memory - In that case can we gracefully handle it?
Cortex.log:
Fuller Log for the above, last few lines
Problem
The root cause of
Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
issue is: the machine doesn't have enough memory to run the model. The default context length of llama3.1 is 131072 -> requires 16 GB of RAM/VRAM to load kv cache. This is bug of llamacpp, with CUDA or CPU version, when machine doesn't have enough resouce to load model, it will throw error. But the metal backend, llamacpp don't throw anything, just silent log.Solution
This issue related to #1108, we should recommend the proper context length base on user hardware. Maybe we should create other ticket to track this. The current solution is set max-context length of every model to 8192 (which is average conversation length - llama3 also has 8192 max context length) -> only requires 1 GB for KV cache.
Steps to Reproduce
No response
Screenshots / Logs
No response
What is your OS?
What engine are you running?
The text was updated successfully, but these errors were encountered: