Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

Closed
2 of 6 tasks
nguyenhoangthuan99 opened this issue Oct 1, 2024 · 1 comment · Fixed by #1370
Closed
2 of 6 tasks
Assignees
Labels
type: bug Something isn't working
Milestone

Comments

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Oct 1, 2024

Cortex version

cortex-nightly

Describe the Bug

Report from @gabrielle-ong

Image

I think its a separate cause - Insufficient Memory - In that case can we gracefully handle it?

Cortex.log:

 - llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)```

Fuller Log for the above, last few lines

20240930 11:02:44.357459 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.357479 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.358381 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.358398 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.359336 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.359354 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.360442 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.360466 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.361376 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.361397 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.362297 UTC 8296618 INFO  ggml_metal_graph_compute: command buffer 0 failed with status 5
 - llama_engine.cc:395
20240930 11:02:44.362316 UTC 8296618 INFO  error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
 - llama_engine.cc:395
20240930 11:02:44.362803 UTC 8296618 INFO  slot released: id_slot: 0, id_task: 3, n_ctx: 131072, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:02:44.362834 UTC 8296619 INFO  Request 7: End of result - llama_engine.cc:785
20240930 11:02:44.362865 UTC 8296619 INFO  Request 7: Task completed, release it - llama_engine.cc:818
20240930 11:02:44.362870 UTC 8296619 INFO  Request 7: Inference completed - llama_engine.cc:832
20240930 11:03:01.100697 UTC 8296353 INFO  Model status responded - llama_engine.cc:337
20240930 11:03:01.103107 UTC 8296355 INFO  Model status responded - llama_engine.cc:337
20240930 11:03:04.615093 UTC 8296356 INFO  Request 8, model Llama-3.2-3B-Instruct: Generating response for inference request - llama_engine.cc:591
20240930 11:03:04.616089 UTC 8296356 INFO  Request 8: Stop words:[
	"<|eot_id|>"
]
 - llama_engine.cc:608
20240930 11:03:04.616203 UTC 8296356 INFO  Request 8: Streamed, waiting for respone - llama_engine.cc:748
20240930 11:03:04.616312 UTC 8367882 INFO  slot released: id_slot: 0, id_task: 1, n_ctx: 4096, n_past: 50, n_system_tokens: 0, n_cache_tokens: 50, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:04.617392 UTC 8367882 INFO  kv cache rm [p0, end) -  id_slot: 0, task_id: 3, p0: 6 - llama_server_context.cc:1549
20240930 11:03:13.363045 UTC 8296356 FATAL Socket is not connected (errno=57) sockets::shutdownWrite - Socket.cc:110
20240930 11:03:30.904436 UTC 8367882 INFO  slot released: id_slot: 0, id_task: 3, n_ctx: 4096, n_past: 514, n_system_tokens: 0, n_cache_tokens: 514, truncated: 0 - llama_server_context.cc:1308
20240930 11:03:30.904556 UTC 8367883 INFO  Request 8: End of result - llama_engine.cc:785
20240930 11:03:30.905464 UTC 8367883 INFO  Request 8: Task completed, release it - llama_engine.cc:818
20240930 11:03:30.905533 UTC 8367883 INFO  Request 8: Inference completed - llama_engine.cc:832

Problem
The root cause of Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) issue is: the machine doesn't have enough memory to run the model. The default context length of llama3.1 is 131072 -> requires 16 GB of RAM/VRAM to load kv cache. This is bug of llamacpp, with CUDA or CPU version, when machine doesn't have enough resouce to load model, it will throw error. But the metal backend, llamacpp don't throw anything, just silent log.

Solution
This issue related to #1108, we should recommend the proper context length base on user hardware. Maybe we should create other ticket to track this. The current solution is set max-context length of every model to 8192 (which is average conversation length - llama3 also has 8192 max context length) -> only requires 1 GB for KV cache.

Steps to Reproduce

No response

Screenshots / Logs

No response

What is your OS?

  • MacOS
  • Windows
  • Linux

What engine are you running?

  • cortex.llamacpp (default)
  • cortex.tensorrt-llm (Nvidia GPUs)
  • cortex.onnx (NPUs, DirectML)
@nguyenhoangthuan99 nguyenhoangthuan99 added the type: bug Something isn't working label Oct 1, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 self-assigned this Oct 1, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Oct 1, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 moved this from Investigating to In Progress in Menlo Oct 1, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Review + QA in Menlo Oct 1, 2024
@gabrielle-ong
Copy link
Contributor

Nicely Solved @nguyenhoangthuan99! above model runs on Mac without insufficient memory error

@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 1, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants