Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Rate Limit Errors when using with PaperQA #7358

Closed
gurugecl opened this issue Dec 22, 2024 · 19 comments
Closed

[Bug]: Rate Limit Errors when using with PaperQA #7358

gurugecl opened this issue Dec 22, 2024 · 19 comments
Labels
bug Something isn't working mlops user request

Comments

@gurugecl
Copy link

gurugecl commented Dec 22, 2024

What happened?

Starting recently I keep getting rate limit errors when using models like Gemini flash 2.0 even though I should be below the rate limit based on the number of requests I'm initiating. Previously this was working fine. I am using LiteLLM via PaperQA. There seems to also be an async issue but that was not previously causing a rate limit error but not sure if that's related. I tried a number of ways to avoid hitting the rate limit but so far none have worked so any assistance with this would be greatly appreciated.

https://github.com/Future-House/paper-qa

I've also seen this message when using gpt-4o but it still works without an issue then

AFC is enabled with max remote calls: 10.

Below is how I am setting up the Settings object which then throws the rate limit error.

from paperqa import Docs, Settings

settings = Settings(
        llm="gemini/gemini-2.0-flash-exp",
        summary_llm="gemini/gemini-2.0-flash-exp",
        llm_config={
            "model_list": [{
                "model_name": "gemini/gemini-2.0-flash-exp",
                "litellm_params": {
                    "model": "gemini/gemini-2.0-flash-exp",
                    "api_key": os.environ.get('GEMINI_API_KEY'),
                }
            }]
        },
        summary_llm_config={
            "model_list": [{
                "model_name": "gemini/gemini-2.0-flash-exp",
                "litellm_params": {
                    "model": "gemini/gemini-2.0-flash-exp",
                    "api_key": os.environ.get('GEMINI_API_KEY'),
                }
            }]
        }
    ) 

max_choices = len(list(docs.docnames))
settings.answer.answer_max_sources = max_choices
settings.answer.evidence_k = relevancy * max_choices

model_response = docs.query(model_input, settings=settings)

Relevant log output

litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.APIConnectionError: <asyncio.locks.Event object at 0x3855b0950 [unset]> is bound to a different event loop
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/litellm/main.py", line 421, in acompletion
    response = await init_response
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/litellm/llms/vertex_ai_and_google_ai_studio/gemini/vertex_and_google_ai_studio_gemini.py", line 1206, in async_completion
    response = await client.post(api_base, headers=headers, json=request_body)  # type: ignore
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/litellm/llms/custom_httpx/http_handler.py", line 138, in post
    raise e
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/litellm/llms/custom_httpx/http_handler.py", line 100, in post
    response = await self.client.send(req, stream=stream)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpx/_client.py", line 1661, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpx/_client.py", line 1689, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpx/_client.py", line 1763, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpx/_transports/default.py", line 373, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/http11.py", line 143, in handle_async_request
    raise exc
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/http11.py", line 113, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/http11.py", line 186, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_async/http11.py", line 224, in _receive_event
    data = await self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/httpcore/_backends/anyio.py", line 35, in read
    return await self._stream.receive(max_bytes=max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/anyio/streams/tls.py", line 205, in receive
    data = await self._call_sslobject_method(self._ssl_object.read, max_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/anyio/streams/tls.py", line 147, in _call_sslobject_method
    data = await self.transport_stream.receive()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 1142, in receive
    await self._protocol.read_event.wait()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/asyncio/locks.py", line 210, in wait
    fut = self._get_loop().create_future()
          ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/django/lib/python3.11/asyncio/mixins.py", line 20, in _get_loop
    raise RuntimeError(f'{self!r} is bound to a different event loop')
RuntimeError: <asyncio.locks.Event object at 0x3855b0950 [unset]> is bound to a different event loop

03:21:53 - LiteLLM:INFO: utils.py:2977 - 
LiteLLM completion() model= gemini-2.0-flash-exp; provider = gemini

LiteLLM completion() model= gemini-2.0-flash-exp; provider = gemini
HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-exp:generateContent?key=xxxxxx "HTTP/1.1 429 Too Many Requests"
03:21:53 - LiteLLM Router:INFO: router.py:849 - litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

03:21:53 - LiteLLM:INFO: utils.py:2977 - 
LiteLLM completion() model= gemini-2.0-flash-exp; provider = gemini
HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-exp:generateContent?key=xxxxxxxx "HTTP/1.1 429 Too Many Requests"

LiteLLM completion() model= gemini-2.0-flash-exp; provider = gemini
HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-exp:generateContent?key=xxxxxx "HTTP/1.1 429 Too Many Requests"
03:21:53 - LiteLLM Router:INFO: router.py:849 - litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

03:21:53 - LiteLLM Router:INFO: router.py:849 - litellm.acompletion(model=gemini/gemini-2.0-flash-exp) Exception litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"
  }
}

Are you a ML Ops Team?

Yes

What LiteLLM version are you on ?

1.45.0

Twitter / LinkedIn details

No response

@gurugecl gurugecl added the bug Something isn't working label Dec 22, 2024
@gurugecl gurugecl changed the title [Bug]: Rate Limit Errors with models other than OAI [Bug]: Rate Limit Errors when using with PaperQA Dec 22, 2024
@krrishdholakia
Copy link
Contributor

hi @gurugecl

"error": {
"code": 429,
"message": "Resource has been exhausted (e.g. check quota).",
"status": "RESOURCE_EXHAUSTED"
}

your error is coming from google saying your resource is exhausted. This doesn't even look like it's coming from the router, just from the backend llm api. Closing for now as i don't see a litellm-specific issue, cc: @jamesbraza let me know if i'm missing something here

@krrishdholakia krrishdholakia closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2024
@gurugecl
Copy link
Author

gurugecl commented Jan 3, 2025

Yes but the issue seems to be that the error is coming on the very first request to Gemini and no amount of exponential backoff or waiting seems to be helping which I believe would indicate something is happening behind the scenes either with Paperqa or Litellm to exhaust the resource so quickly. The following seems to indicate that Gemini Flash 2.0 is supported 7188 and I even tried setting rate limits in the llm config like below according to the PaperQA docs but that did not seem to help

"rate_limit": {
                "gemini/gemini-2.0-flash-exp": "10 per 1 minute"  # Gemini's actual rate limit
            }

However if this not a LiteLLM issue I can reach out to PaperQA for further assistance

@krrishdholakia
Copy link
Contributor

@gurugecl you can validate this with a simple test with litellm

from litellm import completion
import os

os.environ['GEMINI_API_KEY'] = ""
response = completion(
    model="gemini/gemini-2.0-flash-exp", 
    messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}]
)

cc: @jamesbraza - let me know if i'm missing something here

@gurugecl
Copy link
Author

gurugecl commented Jan 4, 2025

Yes that works fine as does using Gemini through the API directly. However the problem arises when Im using Gemini for PaperQA which in turn uses LiteLLM I believe as I describe in this issue I just opened with them, 786. I'm using their Docs object as I need to manually add papers, however most of their documentation is in regards to the ask function for querying papers not the Docs class so Im not sure if the rate limit can be set in the same way and setting the max_concurrent_requests as they suggested didn't resolve the issues yet either

@krrishdholakia
Copy link
Contributor

@gurugecl does it work when trying to use the openai api? with them

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Jan 4, 2025

if so - one way around this is to spin up the cli proxy

export GEMINI_API_KEY=<your-key>
litellm --model gemini/gemini-2.0-flash-exp

# RUNNING on http://0.0.0.0:4000

and then set the openai api base from paperqa to this ^

export OPENAI_API_BASE="http://0.0.0.0:4000"

More docs - https://docs.litellm.ai/docs/#quick-start-proxy---cli

@gurugecl
Copy link
Author

gurugecl commented Jan 4, 2025

Yes it works fine with the OAI api as I believe it is used by default. Unfortunately, I couldn't quite get the proxy to work due to additional errors however I was able to get the implementation working without it by modifying the setup of the Docs object. Not sure yet why original approach worked initially but thanks a lot for your assistance with this. Really appreciate it!

@krrishdholakia
Copy link
Contributor

Nice. Curious what were the proxy errors?

@krrishdholakia
Copy link
Contributor

bump on this? @gurugecl

would be helpful so i know if there's a key issue we might've missed

@gurugecl
Copy link
Author

gurugecl commented Jan 4, 2025

So Im not sure how much of an issue this is on your end but after setting up everything as we discussed I received the following error when trying to add new papers to the PaperQA Docs object:

LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

Failed to load ddu148.pdf: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': {'message': "litellm.UnsupportedParamsError: Setting {'encoding_format': 'base64'} is not supported by gemini. To drop it from the call, set `litellm.drop_params = True`.", 'type': 'None', 'param': None, 'code': '400'}} LiteLLM Retried: 3 times

Loaded 0 documents

minimal script:

from paperqa import Docs, Settings
import os
from pathlib import Path
import litellm

# Set LiteLLM verbose logging
litellm.set_verbose = True

# Set up environment
os.environ["OPENAI_API_BASE"] = "http://0.0.0.0:4000"

# Initialize Docs
docs = Docs()

# Load PDFs from a directory
pdf_dir = Path("test_pdfs") 
for pdf_file in pdf_dir.glob("*.pdf"):
    print(f"Loading {pdf_file.name}")
    try:
        docs.add(str(pdf_file))
    except Exception as e:
        print(f"Failed to load {pdf_file.name}: {str(e)}")

So I modified the litellm command to run like this:

litellm --model gemini/gemini-2.0-flash-exp --drop_params

However at that point it just hung when trying to add papers and that's when I was able to resolve the issues without using the proxy as I mentioned so I didn't attempt to debug the proxy issue beyond this point

@jamesbraza
Copy link
Contributor

It looks like you've misconfigured Settings a bit based on the Pydantic errors, but the more interesting one to me is this one: Failed to load ddu148.pdf: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': {'message': "litellm.UnsupportedParamsError: Setting {'encoding_format': 'base64'} is not supported by gemini.

I am interested where the base64 part comes from, it's not within paper-qa

@krrishdholakia
Copy link
Contributor

i believe that value is set when calling openai embeddings (i believe on their sdk, although i can check that)

@krrishdholakia
Copy link
Contributor

nope, their default is float

Screenshot 2025-01-04 at 5 10 34 PM

--
how do you handle embedding calls? @jamesbraza

@gurugecl
Copy link
Author

gurugecl commented Jan 5, 2025

@jamesbraza yup, sorry, the settings errors are what I fixed to get it to work without the proxy and not related to the proxy issue. Yeah it's the other base64 error that comes up when using the proxy and adding new papers to the Docs object

@jamesbraza
Copy link
Contributor

-- how do you handle embedding calls? @jamesbraza

We call litellm.aembedding: https://github.com/Future-House/llm-client/blob/v0.0.7/llmclient/embeddings.py#L149-L154


And I guess it seems from Setting {'encoding_format': 'base64'} is not supported by gemini it seems base64 is not supported by Gemini. Maybe base64 shouldn't be the default for all model providers?

@krrishdholakia
Copy link
Contributor

krrishdholakia commented Jan 5, 2025

Maybe base64 shouldn't be the default for all model providers?

where do you see this value being set? @jamesbraza

in general, yes - i don't think it needs to be unless there's a specific need for it

@jamesbraza
Copy link
Contributor

It looks "base64" gets set within OpenAI's client library during httpx request creation: https://github.com/openai/openai-python/blob/v1.58.1/src/openai/resources/embeddings.py#L217

I wonder if something in @gurugecl 's code or LiteLLM embedding is incorrectly causing an OpenAI control flow to be used for a Gemini model. It could explain why {'encoding_format': 'base64'} is being passed to Gemini

@krrishdholakia
Copy link
Contributor

@jamesbraza It was when he tried to call Gemini via the cli proxy using the openai route in the paperqa code

@gurugecl
Copy link
Author

gurugecl commented Jan 5, 2025

It looks "base64" gets set within OpenAI's client library during httpx request creation: https://github.com/openai/openai-python/blob/v1.58.1/src/openai/resources/embeddings.py#L217

I wonder if something in @gurugecl 's code or LiteLLM embedding is incorrectly causing an OpenAI control flow to be used for a Gemini model. It could explain why {'encoding_format': 'base64'} is being passed to Gemini

Yeah the code that throws the error just consists of the below minimum example

#7358 (comment)

plus running the proxy commands I was given in a separate terminal

#7358 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mlops user request
Projects
None yet
Development

No branches or pull requests

3 participants