You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A question about the n_batch parameter. Correct me if I'm wrong, but:
The llama.cpp-python library is primarily designed for inference and does not support batched inference, meaning it processes one input sequence at a time to generate a single corresponding output.
In the transformer architecture, the attention mechanism requires access to the entire input context to calculate attention scores and generate meaningful outputs. Each token attends to all other tokens in the input sequence.
Given these points, I'm curious what n_batch is doing. I've seen a lot of discussion about it affecting the speed of inferrence, but I've no idea how that might work, given the above points - If the library processes one input sequence at a time and requires the full context for each token generation, what role does n_batch play in the inference process?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
A question about the n_batch parameter. Correct me if I'm wrong, but:
Given these points, I'm curious what n_batch is doing. I've seen a lot of discussion about it affecting the speed of inferrence, but I've no idea how that might work, given the above points - If the library processes one input sequence at a time and requires the full context for each token generation, what role does n_batch play in the inference process?
Beta Was this translation helpful? Give feedback.
All reactions