Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: retry mechanism #240

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

whilefoo
Copy link
Member

Resolves #236

When I was testing this I realized that OpenAI client already retries requests so technically we don't need this but it's more customizable

@whilefoo
Copy link
Member Author

@gentlementlegen

@gentlementlegen
Copy link
Member

When you say "already retry requests", in which case, network failure? The idea was to cover problems like token size being too big, or the response from the LLM being truncated (sometimes the JSON is malformed for example). Will test in a bit.

@whilefoo
Copy link
Member Author

When you say "already retry requests", in which case, network failure? The idea was to cover problems like token size being too big, or the response from the LLM being truncated (sometimes the JSON is malformed for example). Will test in a bit.

Yes network failures. I thought the issue was also only about the network failures.

@gentlementlegen
Copy link
Member

gentlementlegen commented Jan 15, 2025

While some errors are not recoverable, the ones like us sending too many tokens to be handled can be avoided by splitting the prompt and retrying with a smaller prompt until it fits with the given model.

Very sorry if that wasn't clear enough, I will update the spec.

Added the following:

covering cases like the token amount sent being too large, responses having truncated JSON content, network failures

@whilefoo
Copy link
Member Author

Shouldn't the prompt be split into appropriate size according to the model token limit before sending it to the LLM? I think bruteforcing is not a good solution

@gentlementlegen
Copy link
Member

@whilefoo how do you determine the size? There is no API for this (afaik) and for the same model you can have different limits according to the plan (or tier) you are using. Not sure if there is a better way to handle it.

@whilefoo
Copy link
Member Author

@whilefoo how do you determine the size? There is no API for this (afaik) and for the same model you can have different limits according to the plan (or tier) you are using. Not sure if there is a better way to handle it.

You can count tokens using tiktoken and model token limit is the same for all tiers, the only difference between tiers is rate limit (tokens per minute) so you just have wait until the rate limit clears.

@gentlementlegen
Copy link
Member

gentlementlegen commented Jan 16, 2025

Maybe I don't understand but js-tiktoken can count tokens, but it cannot tell you that the model gpt-4o is limited to 128000 can it? I know that at first we were limited to half of this and asked OpenAI to increase the token limit, which is now upgraded to 128000, that's why I mention this.

@whilefoo
Copy link
Member Author

I didn't know that OpenAI also limits the context window as I couldn't find any information about that on the internet.

My understanding is that 128k context window is the same for everyone but if you're on tier 1 you only have 30k TPM so you can't use the whole context window. If my understanding is correct then even if you split prompts into 30k tokens, you will still get rate limited if you send those requests within the same minute?

@gentlementlegen
Copy link
Member

I believe you are correct, would this be handled by the built-in retry or should we add delay between API calss?

I think the biggest issue the retry can cover is the malformed JSON, there are plenty of times we just had to restart the whole comments evaluation because one output was incorrect and breaks the whole evaluation.

@whilefoo
Copy link
Member Author

whilefoo commented Jan 16, 2025

I believe you are correct, would this be handled by the built-in retry or should we add delay between API calss?

I think they don't add any delays because their code is auto-generated by OpenAPI spec.
We could make a dummy request and see rate limits in the response headers.
Basically we know the model's context window either by hard-coding for most used models or from the config. We can technically get rate limits so together with context window we can calculate how many chunks we need in advance and then just retry until it succeeds.

I think the biggest issue the retry can cover is the malformed JSON, there are plenty of times we just had to restart the whole comments evaluation because one output was incorrect and breaks the whole evaluation.

I think this can be solved by using structured outputs but this is OpenAI specific so if we plan to use OpenRouter then it still makes sense to implement it

Copy link

@whilefoo, this task has been idle for a while. Please provide an update.

@whilefoo
Copy link
Member Author

now that #225 is merged I will continue with this

Copy link

@whilefoo, this task has been idle for a while. Please provide an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement retry mechanism for LLM failures
2 participants