Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency on the first inference call #7

Open
rlancemartin opened this issue May 29, 2023 · 3 comments
Open

High latency on the first inference call #7

rlancemartin opened this issue May 29, 2023 · 3 comments

Comments

@rlancemartin
Copy link

rlancemartin commented May 29, 2023

We are using the replicate integration with LangChain:

llm = Replicate(model="replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
                input={"temperature": 0.75, "max_length": 3000, "top_p":0.25})

We are benchmarking latency for question-answering using LangChain auto-evaluator app:
https://autoevaluator.langchain.com/playground

I run several inference calls and measure latency of each:

  • Call 1: 195.8 sec
  • Call 2: 7.7 sec
  • Call 3: 11.7 sec

We see very high inference latency (e.g., 195 sec) for the initial call.

But, subsequent calls are much faster < 10 sec.

This is consistent across runs.

For example, another run today:

  • Call 1: 241.556 sec
  • Call 2: 5.951 sec
  • Call 3: 11.295 sec

With additional logging, I confirmed that latency is indeed from calling the endpoint.

Why is this?

It hurts the latency assessment of Vicuna-13b relative to other models:

image

@rlancemartin
Copy link
Author

@joehoover any ideas on what may be happening?

@dankolesnikov
Copy link

Wow, 195.8 sec is massive! @joehoover could this be a cold start problem?
cc @bfirsh @mattt for visibility.

@joehoover
Copy link

joehoover commented Jun 2, 2023

Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.

@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?

Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants