Reset schedule earlier to allow overlap with ggml graph computation on device #6933

agray3 · 2024-04-26T15:55:03Z

Previously, significant CPU memset calls between each token generation were on critical path.
This change performs these earlier, while the CPU is waiting for the previous token to be
generated on the device.

Refs #6763

Refs ggerganov#6763

slaren · 2024-04-26T16:06:18Z

The change looks good, I was aware that these memset are very expensive and I am working on other change to reduce the amount of memory that needs to be cleared. It may make more sense to make the reset call at the end of llama_decode_internal however.

sorasoras · 2024-04-26T16:21:53Z

It's this the reason the performance difference between a PCIE X2 P40 and PCIE X8 P40.

agray3 · 2024-04-26T17:26:20Z

The change looks good, I was aware that these memset are very expensive and I am working on other change to reduce the amount of memory that needs to be cleared. It may make more sense to make the reset call at the end of llama_decode_internal however.

Thanks, now moved the reset call as suggested.

agray3 · 2024-04-26T17:28:31Z

It's this the reason the performance difference between a PCIE X2 P40 and PCIE X8 P40.

This change doesn't affect any PCIe data transfers, only CPU activity. However if you have different CPUs or CPU memory configs in the two systems it could contribute to any difference.

ggml-backend.c

github-actions · 2024-04-26T21:06:17Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 423 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=11154.26ms p(95)=28420.49ms fails=, finish reason: stop=369 truncated=54
Prompt processing (pp): avg=123.47tk/s p(95)=544.09tk/s
Token generation (tg): avg=23.7tk/s p(95)=36.58tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=ag_early_sched_reset commit=728562bc129ff0ba59ee9e43b4a13288737d76a6

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 452.1, 452.1, 452.1, 452.1, 452.1, 437.95, 437.95, 437.95, 437.95, 437.95, 425.61, 425.61, 425.61, 425.61, 425.61, 452.05, 452.05, 452.05, 452.05, 452.05, 497.38, 497.38, 497.38, 497.38, 497.38, 507.17, 507.17, 507.17, 507.17, 507.17, 508.2, 508.2, 508.2, 508.2, 508.2, 529.15, 529.15, 529.15, 529.15, 529.15, 547.2, 547.2, 547.2, 547.2, 547.2, 550.39, 550.39, 550.39, 550.39, 550.39, 561.01, 561.01, 561.01, 561.01, 561.01, 560.42, 560.42, 560.42, 560.42, 560.42, 576.29, 576.29, 576.29, 576.29, 576.29, 575.99, 575.99, 575.99, 575.99, 575.99, 582.33, 582.33, 582.33, 582.33, 582.33, 590.54, 590.54, 590.54, 590.54, 590.54, 596.3, 596.3, 596.3, 596.3, 596.3, 558.62, 558.62, 558.62, 558.62, 558.62, 563.43, 563.43, 563.43, 563.43, 563.43, 564.03, 564.03, 564.03, 564.03, 564.03, 564.05, 564.05, 564.05, 564.05, 564.05, 570.28, 570.28, 570.28, 570.28, 570.28, 569.45, 569.45, 569.45, 569.45, 569.45, 571.32, 571.32, 571.32, 571.32, 571.32, 571.89, 571.89, 571.89, 571.89, 571.89, 578.19, 578.19, 578.19, 578.19, 578.19, 582.21, 582.21, 582.21, 582.21, 582.21, 585.98, 585.98, 585.98, 585.98, 585.98, 597.49, 597.49, 597.49, 597.49, 597.49, 598.31, 598.31, 598.31, 598.31, 598.31, 601.15, 601.15, 601.15, 601.15, 601.15, 603.04, 603.04, 603.04, 603.04, 603.04, 612.28, 612.28, 612.28, 612.28, 612.28, 610.21, 610.21, 610.21, 610.21, 610.21, 611.71, 611.71, 611.71, 611.71, 611.71, 612.86, 612.86, 612.86, 612.86, 612.86, 618.71, 618.71, 618.71, 618.71, 618.71, 620.77, 620.77, 620.77, 620.77, 620.77, 620.59, 620.59, 620.59, 620.59, 620.59, 624.52, 624.52, 624.52, 624.52, 624.52, 628.96, 628.96, 628.96, 628.96, 628.96, 634.48, 634.48, 634.48, 634.48, 634.48, 635.59, 635.59, 635.59, 635.59, 635.59, 620.6, 620.6, 620.6, 620.6, 620.6, 620.05, 620.05, 620.05, 620.05, 620.05, 619.55, 619.55, 619.55, 619.55, 619.55, 620.16, 620.16, 620.16, 620.16, 620.16, 622.96, 622.96, 622.96, 622.96, 622.96, 625.94, 625.94, 625.94, 625.94, 625.94, 632.33, 632.33, 632.33, 632.33, 632.33, 634.71, 634.71, 634.71, 634.71, 634.71, 631.54, 631.54, 631.54, 631.54, 631.54, 620.93, 620.93, 620.93, 620.93, 620.93, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 616.95, 616.95, 616.95, 616.95, 616.95, 615.89, 615.89, 615.89, 615.89, 615.89, 614.89, 614.89, 614.89, 614.89, 614.89, 615.56, 615.56, 615.56, 615.56, 615.56, 618.54, 618.54, 618.54, 618.54, 618.54, 618.54, 618.54]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.95, 38.95, 38.95, 38.95, 38.95, 22.89, 22.89, 22.89, 22.89, 22.89, 22.26, 22.26, 22.26, 22.26, 22.26, 23.56, 23.56, 23.56, 23.56, 23.56, 24.74, 24.74, 24.74, 24.74, 24.74, 24.75, 24.75, 24.75, 24.75, 24.75, 24.7, 24.7, 24.7, 24.7, 24.7, 25.03, 25.03, 25.03, 25.03, 25.03, 25.24, 25.24, 25.24, 25.24, 25.24, 25.35, 25.35, 25.35, 25.35, 25.35, 25.1, 25.1, 25.1, 25.1, 25.1, 24.99, 24.99, 24.99, 24.99, 24.99, 24.62, 24.62, 24.62, 24.62, 24.62, 24.33, 24.33, 24.33, 24.33, 24.33, 24.2, 24.2, 24.2, 24.2, 24.2, 23.74, 23.74, 23.74, 23.74, 23.74, 23.44, 23.44, 23.44, 23.44, 23.44, 22.6, 22.6, 22.6, 22.6, 22.6, 22.57, 22.57, 22.57, 22.57, 22.57, 22.63, 22.63, 22.63, 22.63, 22.63, 22.68, 22.68, 22.68, 22.68, 22.68, 22.57, 22.57, 22.57, 22.57, 22.57, 22.37, 22.37, 22.37, 22.37, 22.37, 22.25, 22.25, 22.25, 22.25, 22.25, 21.93, 21.93, 21.93, 21.93, 21.93, 21.81, 21.81, 21.81, 21.81, 21.81, 21.91, 21.91, 21.91, 21.91, 21.91, 22.01, 22.01, 22.01, 22.01, 22.01, 21.99, 21.99, 21.99, 21.99, 21.99, 21.9, 21.9, 21.9, 21.9, 21.9, 21.95, 21.95, 21.95, 21.95, 21.95, 22.01, 22.01, 22.01, 22.01, 22.01, 22.07, 22.07, 22.07, 22.07, 22.07, 21.92, 21.92, 21.92, 21.92, 21.92, 21.89, 21.89, 21.89, 21.89, 21.89, 22.01, 22.01, 22.01, 22.01, 22.01, 22.26, 22.26, 22.26, 22.26, 22.26, 22.32, 22.32, 22.32, 22.32, 22.32, 22.48, 22.48, 22.48, 22.48, 22.48, 22.59, 22.59, 22.59, 22.59, 22.59, 22.62, 22.62, 22.62, 22.62, 22.62, 22.56, 22.56, 22.56, 22.56, 22.56, 22.5, 22.5, 22.5, 22.5, 22.5, 22.49, 22.49, 22.49, 22.49, 22.49, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.25, 22.25, 22.25, 22.25, 22.25, 22.33, 22.33, 22.33, 22.33, 22.33, 22.4, 22.4, 22.4, 22.4, 22.4, 22.48, 22.48, 22.48, 22.48, 22.48, 22.51, 22.51, 22.51, 22.51, 22.51, 22.38, 22.38, 22.38, 22.38, 22.38, 22.32, 22.32, 22.32, 22.32, 22.32, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 21.47, 21.47, 21.47, 21.47, 21.47, 21.15, 21.15, 21.15, 21.15, 21.15, 20.52, 20.52, 20.52, 20.52, 20.52, 20.36, 20.36, 20.36, 20.36, 20.36, 20.4, 20.4, 20.4, 20.4, 20.4, 20.47, 20.47]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.07, 0.07, 0.07, 0.07, 0.07, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.34, 0.34, 0.34, 0.34, 0.34, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.57, 0.57, 0.57, 0.57, 0.57, 0.63, 0.63, 0.63, 0.63, 0.63, 0.5, 0.5, 0.5, 0.5, 0.5, 0.51, 0.51, 0.51, 0.51, 0.51, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0]

…n device (ggerganov#6933) * Reset schedule earlier to allow overlap with graph computation on device

Reset schedule earlier to allow overlap with graph computation on device

a2beaff

Refs ggerganov#6763

agray3 mentioned this pull request Apr 26, 2024

New optimization from NVIDIA to use CUDA Graphs in llama.cpp #6763

Closed

moved reset to end of llama_decode_internal

34847ca

agray3 marked this pull request as ready for review April 26, 2024 17:26

slaren reviewed Apr 26, 2024

View reviewed changes

ggml-backend.c Outdated Show resolved Hide resolved

style fix

728562b

slaren approved these changes Apr 26, 2024

View reviewed changes

slaren merged commit 928e0b7 into ggerganov:master Apr 26, 2024
23 of 26 checks passed

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024

Reset schedule earlier to allow overlap with ggml graph computation o…

4f20618

…n device (ggerganov#6933) * Reset schedule earlier to allow overlap with graph computation on device

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset schedule earlier to allow overlap with ggml graph computation on device #6933

Reset schedule earlier to allow overlap with ggml graph computation on device #6933

agray3 commented Apr 26, 2024

slaren commented Apr 26, 2024

sorasoras commented Apr 26, 2024

agray3 commented Apr 26, 2024

agray3 commented Apr 26, 2024

github-actions bot commented Apr 26, 2024

Reset schedule earlier to allow overlap with ggml graph computation on device #6933

Reset schedule earlier to allow overlap with ggml graph computation on device #6933

Conversation

agray3 commented Apr 26, 2024

slaren commented Apr 26, 2024

sorasoras commented Apr 26, 2024

agray3 commented Apr 26, 2024

agray3 commented Apr 26, 2024

github-actions bot commented Apr 26, 2024