fix: split tensor uma estimate result

Signed-off-by: thxCode <[email protected]>
gpustack · Aug 23, 2024 · ff62abf · ff62abf
1 parent 405be43
commit ff62abf
Show file tree

Hide file tree

Showing 2 changed files with 7 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -48,7 +48,7 @@ GGUF Parser helps in reviewing and estimating the usage of a GGUF format model w
 
 - Since v0.7.2, GGUF Parser supports retrieving the model's metadata via split file,
   which suffixes with something like `-00001-of-00009.gguf`.
-- The table result `UMA` indicates the memory usage of Apple MacOS only.
+- The table result `UMA` indicates the memory usage of Apple macOS only.
 - Since v0.7.0, GGUF Parser is going to support estimating the usage of multiple GPUs.
     + The table result `RAM` means the system memory usage when
       running [LLaMA.Cpp](https://github.com/ggerganov/llama.cpp) or LLaMA.Cpp like application.

diff --git a/file_estimate.go b/file_estimate.go
@@ -641,6 +641,12 @@ func (e LLaMACppUsageEstimate) SummarizeMemory(mmap bool, nonUMARamFootprint, no
 			ems.VRAMs[i].UMA = fp + wg + kv + /* cp */ 0
 			if !e.NoMMap && mmap {
 				ems.VRAMs[i].UMA -= wg
+				// NB(thxCode): the weight add back for the following reasons:
+				// - UMA treats as one device.
+				// - RPC server will load all weights and computation.
+				if i > 0 {
+					ems.VRAMs[i].UMA += wg + cp
+				}
 			}
 
 			// NonUMA.