You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.
@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?
Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.
We are using the replicate integration with LangChain:
We are benchmarking latency for question-answering using LangChain auto-evaluator app:
https://autoevaluator.langchain.com/playground
I run several inference calls and measure latency of each:
We see very high inference latency (e.g.,
195 sec
) for the initial call.But, subsequent calls are much faster < 10 sec.
This is consistent across runs.
For example, another run today:
With additional logging, I confirmed that latency is indeed from calling the endpoint.
Why is this?
It hurts the latency assessment of Vicuna-13b relative to other models:
The text was updated successfully, but these errors were encountered: