Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown on processing prompt upgrading from 1.78 to 1.79 #1266

Open
Tedris opened this issue Dec 14, 2024 · 2 comments
Open

Slowdown on processing prompt upgrading from 1.78 to 1.79 #1266

Tedris opened this issue Dec 14, 2024 · 2 comments

Comments

@Tedris
Copy link

Tedris commented Dec 14, 2024

I ran a couple of benchmarks when I noticed slowdown between the two versions:

Running GeForce 4070 Super 12 GB with 32 GB RAM

1.78

Processing Prompt [BLAS] (16284 / 16284 tokens)
Generating (100 / 100 tokens)
CtxLimit:16384/16384, Amt:100/100, Init:0.07s, Process:21.22s (1.3ms/T = 767.28T/s), Generate:23.24s (232.4ms/T = 4.30T/s), Total:44.46s (2.25T/s)
Benchmark Completed - v1.78 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=['lowvram', '0', 'mmq'] Tensor_Split=None BlasThreads=16 BlasBatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2024-12-14 20:41:03.368064+00:00
Backend: koboldcpp_cublas.dll
Layers: 59
Model: Cydonia-22B-v1.3.i1-IQ4_XS
MaxCtx: 16384
GenAmount: 100
-----
ProcessingTime: 21.223s
ProcessingSpeed: 767.28T/s
GenerationTime: 23.240s
GenerationSpeed: 4.30T/s
TotalTime: 44.463s
Output:  1 1 1 1
-----

1.78

Processing Prompt [BLAS] (16284 / 16284 tokens)
Generating (100 / 100 tokens)
[11:31:34] CtxLimit:16384/16384, Amt:100/100, Init:0.06s, Process:27.73s (1.7ms/T = 587.23T/s), Generate:38.02s (380.2ms/T = 2.63T/s), Total:65.75s (1.52T/s)
Benchmark Completed - v1.79.1 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=['lowvram', '0', 'mmq'] Tensor_Split=None BlasThreads=16 BlasBatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2024-12-14 16:31:34.073011+00:00
Backend: koboldcpp_cublas.dll
Layers: 59
Model: Cydonia-22B-v1.3.i1-IQ4_XS
MaxCtx: 16384
GenAmount: 100
-----
ProcessingTime: 27.730s
ProcessingSpeed: 587.23T/s
GenerationTime: 38.024s
GenerationSpeed: 2.63T/s
TotalTime: 65.754s
Output:  1 1 1 1
-----

Notice a 21 second increase in total time taken from 1.78 to 1.79

@3750gustavo
Copy link

I tested here and also found a tiny increase on 1.78, but in my case it was miniscule, 1.79 t/s on 1.78v versus 1.76 t/s on version 1.79
IMG_0624

IMG_0618

my system is kinda similar, just less Vram: 8gb vram 3070 ti and 32gb ram

@LostRuins
Copy link
Owner

How about v1.80?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants