High latency on the first inference call #7

rlancemartin · 2023-05-29T23:38:05Z

We are using the replicate integration with LangChain:

llm = Replicate(model="replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
                input={"temperature": 0.75, "max_length": 3000, "top_p":0.25})

We are benchmarking latency for question-answering using LangChain auto-evaluator app:
https://autoevaluator.langchain.com/playground

I run several inference calls and measure latency of each:

Call 1: 195.8 sec
Call 2: 7.7 sec
Call 3: 11.7 sec

We see very high inference latency (e.g., 195 sec) for the initial call.

But, subsequent calls are much faster < 10 sec.

This is consistent across runs.

For example, another run today:

Call 1: 241.556 sec
Call 2: 5.951 sec
Call 3: 11.295 sec

With additional logging, I confirmed that latency is indeed from calling the endpoint.

Why is this?

It hurts the latency assessment of Vicuna-13b relative to other models:

The text was updated successfully, but these errors were encountered:

rlancemartin · 2023-05-31T03:55:01Z

@joehoover any ideas on what may be happening?

dankolesnikov · 2023-05-31T14:59:38Z

Wow, 195.8 sec is massive! @joehoover could this be a cold start problem?
cc @bfirsh @mattt for visibility.

joehoover · 2023-06-02T14:57:21Z

Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.

@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?

Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.

rlancemartin mentioned this issue May 31, 2023

max_length parameter is not honored w/ Vicuna13-b #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latency on the first inference call #7

High latency on the first inference call #7

rlancemartin commented May 29, 2023 •

edited

Loading

rlancemartin commented May 31, 2023

dankolesnikov commented May 31, 2023

joehoover commented Jun 2, 2023 •

edited

Loading

High latency on the first inference call #7

High latency on the first inference call #7

Comments

rlancemartin commented May 29, 2023 • edited Loading

rlancemartin commented May 31, 2023

dankolesnikov commented May 31, 2023

joehoover commented Jun 2, 2023 • edited Loading

rlancemartin commented May 29, 2023 •

edited

Loading

joehoover commented Jun 2, 2023 •

edited

Loading