Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset schedule earlier to allow overlap with ggml graph computation on device #6933

Merged
merged 3 commits into from
Apr 26, 2024

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Apr 26, 2024

Previously, significant CPU memset calls between each token generation were on critical path.
This change performs these earlier, while the CPU is waiting for the previous token to be
generated on the device.

Refs #6763

@slaren
Copy link
Collaborator

slaren commented Apr 26, 2024

The change looks good, I was aware that these memset are very expensive and I am working on other change to reduce the amount of memory that needs to be cleared. It may make more sense to make the reset call at the end of llama_decode_internal however.

@sorasoras
Copy link

It's this the reason the performance difference between a PCIE X2 P40 and PCIE X8 P40.

@agray3
Copy link
Contributor Author

agray3 commented Apr 26, 2024

The change looks good, I was aware that these memset are very expensive and I am working on other change to reduce the amount of memory that needs to be cleared. It may make more sense to make the reset call at the end of llama_decode_internal however.

Thanks, now moved the reset call as suggested.

@agray3 agray3 marked this pull request as ready for review April 26, 2024 17:26
@agray3
Copy link
Contributor Author

agray3 commented Apr 26, 2024

It's this the reason the performance difference between a PCIE X2 P40 and PCIE X8 P40.

This change doesn't affect any PCIe data transfers, only CPU activity. However if you have different CPUs or CPU memory configs in the two systems it could contribute to any difference.

ggml-backend.c Outdated Show resolved Hide resolved
@slaren slaren merged commit 928e0b7 into ggerganov:master Apr 26, 2024
23 of 26 checks passed
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 423 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=11154.26ms p(95)=28420.49ms fails=, finish reason: stop=369 truncated=54
  • Prompt processing (pp): avg=123.47tk/s p(95)=544.09tk/s
  • Token generation (tg): avg=23.7tk/s p(95)=36.58tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=ag_early_sched_reset commit=728562bc129ff0ba59ee9e43b4a13288737d76a6

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 452.1, 452.1, 452.1, 452.1, 452.1, 437.95, 437.95, 437.95, 437.95, 437.95, 425.61, 425.61, 425.61, 425.61, 425.61, 452.05, 452.05, 452.05, 452.05, 452.05, 497.38, 497.38, 497.38, 497.38, 497.38, 507.17, 507.17, 507.17, 507.17, 507.17, 508.2, 508.2, 508.2, 508.2, 508.2, 529.15, 529.15, 529.15, 529.15, 529.15, 547.2, 547.2, 547.2, 547.2, 547.2, 550.39, 550.39, 550.39, 550.39, 550.39, 561.01, 561.01, 561.01, 561.01, 561.01, 560.42, 560.42, 560.42, 560.42, 560.42, 576.29, 576.29, 576.29, 576.29, 576.29, 575.99, 575.99, 575.99, 575.99, 575.99, 582.33, 582.33, 582.33, 582.33, 582.33, 590.54, 590.54, 590.54, 590.54, 590.54, 596.3, 596.3, 596.3, 596.3, 596.3, 558.62, 558.62, 558.62, 558.62, 558.62, 563.43, 563.43, 563.43, 563.43, 563.43, 564.03, 564.03, 564.03, 564.03, 564.03, 564.05, 564.05, 564.05, 564.05, 564.05, 570.28, 570.28, 570.28, 570.28, 570.28, 569.45, 569.45, 569.45, 569.45, 569.45, 571.32, 571.32, 571.32, 571.32, 571.32, 571.89, 571.89, 571.89, 571.89, 571.89, 578.19, 578.19, 578.19, 578.19, 578.19, 582.21, 582.21, 582.21, 582.21, 582.21, 585.98, 585.98, 585.98, 585.98, 585.98, 597.49, 597.49, 597.49, 597.49, 597.49, 598.31, 598.31, 598.31, 598.31, 598.31, 601.15, 601.15, 601.15, 601.15, 601.15, 603.04, 603.04, 603.04, 603.04, 603.04, 612.28, 612.28, 612.28, 612.28, 612.28, 610.21, 610.21, 610.21, 610.21, 610.21, 611.71, 611.71, 611.71, 611.71, 611.71, 612.86, 612.86, 612.86, 612.86, 612.86, 618.71, 618.71, 618.71, 618.71, 618.71, 620.77, 620.77, 620.77, 620.77, 620.77, 620.59, 620.59, 620.59, 620.59, 620.59, 624.52, 624.52, 624.52, 624.52, 624.52, 628.96, 628.96, 628.96, 628.96, 628.96, 634.48, 634.48, 634.48, 634.48, 634.48, 635.59, 635.59, 635.59, 635.59, 635.59, 620.6, 620.6, 620.6, 620.6, 620.6, 620.05, 620.05, 620.05, 620.05, 620.05, 619.55, 619.55, 619.55, 619.55, 619.55, 620.16, 620.16, 620.16, 620.16, 620.16, 622.96, 622.96, 622.96, 622.96, 622.96, 625.94, 625.94, 625.94, 625.94, 625.94, 632.33, 632.33, 632.33, 632.33, 632.33, 634.71, 634.71, 634.71, 634.71, 634.71, 631.54, 631.54, 631.54, 631.54, 631.54, 620.93, 620.93, 620.93, 620.93, 620.93, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 619.02, 616.95, 616.95, 616.95, 616.95, 616.95, 615.89, 615.89, 615.89, 615.89, 615.89, 614.89, 614.89, 614.89, 614.89, 614.89, 615.56, 615.56, 615.56, 615.56, 615.56, 618.54, 618.54, 618.54, 618.54, 618.54, 618.54, 618.54]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.95, 38.95, 38.95, 38.95, 38.95, 22.89, 22.89, 22.89, 22.89, 22.89, 22.26, 22.26, 22.26, 22.26, 22.26, 23.56, 23.56, 23.56, 23.56, 23.56, 24.74, 24.74, 24.74, 24.74, 24.74, 24.75, 24.75, 24.75, 24.75, 24.75, 24.7, 24.7, 24.7, 24.7, 24.7, 25.03, 25.03, 25.03, 25.03, 25.03, 25.24, 25.24, 25.24, 25.24, 25.24, 25.35, 25.35, 25.35, 25.35, 25.35, 25.1, 25.1, 25.1, 25.1, 25.1, 24.99, 24.99, 24.99, 24.99, 24.99, 24.62, 24.62, 24.62, 24.62, 24.62, 24.33, 24.33, 24.33, 24.33, 24.33, 24.2, 24.2, 24.2, 24.2, 24.2, 23.74, 23.74, 23.74, 23.74, 23.74, 23.44, 23.44, 23.44, 23.44, 23.44, 22.6, 22.6, 22.6, 22.6, 22.6, 22.57, 22.57, 22.57, 22.57, 22.57, 22.63, 22.63, 22.63, 22.63, 22.63, 22.68, 22.68, 22.68, 22.68, 22.68, 22.57, 22.57, 22.57, 22.57, 22.57, 22.37, 22.37, 22.37, 22.37, 22.37, 22.25, 22.25, 22.25, 22.25, 22.25, 21.93, 21.93, 21.93, 21.93, 21.93, 21.81, 21.81, 21.81, 21.81, 21.81, 21.91, 21.91, 21.91, 21.91, 21.91, 22.01, 22.01, 22.01, 22.01, 22.01, 21.99, 21.99, 21.99, 21.99, 21.99, 21.9, 21.9, 21.9, 21.9, 21.9, 21.95, 21.95, 21.95, 21.95, 21.95, 22.01, 22.01, 22.01, 22.01, 22.01, 22.07, 22.07, 22.07, 22.07, 22.07, 21.92, 21.92, 21.92, 21.92, 21.92, 21.89, 21.89, 21.89, 21.89, 21.89, 22.01, 22.01, 22.01, 22.01, 22.01, 22.26, 22.26, 22.26, 22.26, 22.26, 22.32, 22.32, 22.32, 22.32, 22.32, 22.48, 22.48, 22.48, 22.48, 22.48, 22.59, 22.59, 22.59, 22.59, 22.59, 22.62, 22.62, 22.62, 22.62, 22.62, 22.56, 22.56, 22.56, 22.56, 22.56, 22.5, 22.5, 22.5, 22.5, 22.5, 22.49, 22.49, 22.49, 22.49, 22.49, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.25, 22.25, 22.25, 22.25, 22.25, 22.33, 22.33, 22.33, 22.33, 22.33, 22.4, 22.4, 22.4, 22.4, 22.4, 22.48, 22.48, 22.48, 22.48, 22.48, 22.51, 22.51, 22.51, 22.51, 22.51, 22.38, 22.38, 22.38, 22.38, 22.38, 22.32, 22.32, 22.32, 22.32, 22.32, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 22.02, 21.47, 21.47, 21.47, 21.47, 21.47, 21.15, 21.15, 21.15, 21.15, 21.15, 20.52, 20.52, 20.52, 20.52, 20.52, 20.36, 20.36, 20.36, 20.36, 20.36, 20.4, 20.4, 20.4, 20.4, 20.4, 20.47, 20.47]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.07, 0.07, 0.07, 0.07, 0.07, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.34, 0.34, 0.34, 0.34, 0.34, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.57, 0.57, 0.57, 0.57, 0.57, 0.63, 0.63, 0.63, 0.63, 0.63, 0.5, 0.5, 0.5, 0.5, 0.5, 0.51, 0.51, 0.51, 0.51, 0.51, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 423 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714164940 --> 1714165572
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0]
                    
Loading

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
…n device (ggerganov#6933)

* Reset schedule earlier to allow overlap with graph computation on device
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants