-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't retry on connection error #80
Don't retry on connection error #80
Conversation
08e0c46
to
baae1c3
Compare
baae1c3
to
34ce3a9
Compare
I think one of the reasons for being liberal with the exception handling at this level is that it's not uncommon in runs of MT-Bench to see 400's get returned randomly in a sea of 200's. That being said it seems like the errors you're handling are not 400s so this fix seems helpful. |
return response.choices[0].message.content | ||
except openai.APITimeoutError as ex: | ||
logger.debug(ex) | ||
except openai.APIConnectionError as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One error condition we should be able to create is overloading llama-cpp cpu serving with max-workers. That tends to create a service unavailable condition. Would be good to know where that hits in here.
Also, when calling against the openai api, do we know for sure this isn't a condition we hit during normal use? They were retrying up to 16 times previously but I don't know the common errors returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't know how often 40x return from API. I will rely on your judgement. (Which I think suggests it's a common scenario.)
My primary goal is to make the code that calls this function to fail (to report back to CI that it's broken). Without it, e2e runs claim success when in reality evaluation crashed. I think CI status should be indicative of a failure - one should not have to drill down into logs to double-check a job actually worked.
But: I think this can be achieved while leaving the retry logic for APIConnectionError
in place. Just need to raise
instead of return output
. This will allow us to retry for some time but still fail if an error happened.
What do you think @danmcp ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:
- Getting 400s because of max context length issues. You would get this using granite as the judge model for example.
- llama-ccp cpu serving being overloaded and returning service unavailable.
Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.
For the context length issues additional improvements might be:
- Don't retry when you hit the issue the first time but record the error and move on
- Default to failing fast with this condition and allow for an environment variable to override
The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.
The 400s are mostly from context length issues. Those shouldn't be an issue with an appropriate judge model. We actually reduced the hardcoded retries constant from 16 to 4 as we don't experience so many failures with a local model vs the public api. We do need a way to tolerate those errors for the purposes of testing still. If we move completely to a fail fast model. We'll probably need an env var option to enable testing. I really appreciate trying to improve this logic further and think the effort to pick out specific errors that should fail fast is a good endeavor. The original code was AFAICT designed around managing frequent errors but sacrificing consistency and completeness. Since these are such long running processes and fit into the middle of ever longer running processes, I am not sure yet we can fail completely when we do get sporadic failures. |
@danmcp what if we were to add some handling on the final retry and raise something more descriptive of what actually went wrong? Whether we want to ignore a one-off error during a single run is one thing, but if we hit max retries and especially if it's all the same error, I think it makes sense to have enough granularity to be able to easily see what went wrong. |
As an example of an issue that's easy to hit is the context length of your judge model is too small. Something around 5120 is required by mt-bench. In those cases, for the conversations that require larger than your judge model supports, it's going to fail consistently on those questions. We then have a choice of failing and raising the exception or failing and counting in the error rate. The original logic ignored the error and so we added the error rate to at least account for it. And other errors are possible too such as the model might not give an answer in the expected format. If we change the logic to fail fast on errors, it could throw away hours of runtime. If everything is perfectly tuned, this might be fine. But we would need to prove that can be accomplished reliably before we can make such a change. I think some easy wins are:
|
@danmcp "Adding an option to not raise exceptions on a particular class of problems. Ex: context length of the model isn't long enough." I think the issue is that there's no specific type to catch for such particular error. (But we can check what we have.) I think in general - 1) the library should raise on error; 2) the caller should decide what to do with the error (log, crash, etc.) We can make callers track exceptions and, at the end of the eval, use the fact that any errors happened as the reason to exit with non-zero value (which should tell the test script it failed.) btw I noticed that openai library apparently already retries and there's a tunable to control how frequently it does. https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/src/openai/__init__.py#L117 |
return response.choices[0].message.content | ||
except openai.APITimeoutError as ex: | ||
logger.debug(ex) | ||
except openai.APIConnectionError as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:
- Getting 400s because of max context length issues. You would get this using granite as the judge model for example.
- llama-ccp cpu serving being overloaded and returning service unavailable.
Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.
For the context length issues additional improvements might be:
- Don't retry when you hit the issue the first time but record the error and move on
- Default to failing fast with this condition and allow for an environment variable to override
The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.
I am not quite sure how this would work. To be able to make the decision on whether to continue on error, we couldn't raise an exception all the way outside the library. At that point it's too late to continue. We could have some sort of callback mechanism I guess. At a high level I think we can all agree it's not great the code is eating errors. More granular error handling should be able to differentiate the different types of errors and stop eating any errors that indicate something is broken vs hitting iffy configurations vs flakiness in various serving options. But it's not clear that passing the errors on regardless of type solves the general problem. The purpose of the library is to execute a benchmark. In this case, it's a long running benchmark and some of the questions being asked by the benchmark may fail to get a valid response. The library currently pushes the responsibility to the user to decide whether the error rate is acceptable. So unless we get to the point where all api errors can be eliminated for long runs, it's not clear yet how we can do much better than isolating the failures which should fail fast (like model not found). Letting the caller make a decision with a callback, or with exception handling (not sure how it would work still), or with env variable options, or by passing in options to declare what types of error should fail are all options we could support. I am in favor of using such options to be able to fail for things like the max context length issue by default with an option to override. If there is a suggestion to fail for any exception after retries, we just need to explain how it's going to work for long running jobs with potentially small error rates as well as work for testing with smaller context length models. |
For reference, here is the list of |
Thanks for this, I think I start understanding larger context here. A strict exception discipline won't do when you'd like work to continue even after a bad answer is returned. I think ultimately what's missing is Another venue to explore could be: the library uses file system as interface to pass generated data to the caller. It could instead return the produced dataset (via Ignoring what can be done in this library, the caller could probably open the written file and check if it contains any error markers. If so, make sure an error return code is returned. (When it's ready.) I will take some time to think on the issue. But if you have some preference or thoughts on the above alternatives, please let me know. |
That's a really good point. I've been thinking about this in terms of the openai api and the error rates we expose from judgment. But we don't expose error rates on gen answers today. Giving an error rate might be an appropriate return value? That would match what we do for judgment.
As of now, my preference is to:
Can be fixed later, if we do we should open an issue
|
For now, a more limited approach just making the tests fail on errors in the output file: #85 while we think through a way to pass errors or error rate directly. |
Also splitting out the sleep fix here: #84 since it's not directly related to this issue. |
The `python` symlink may be missing on a system; or even point to py2. We should use `python3` to be sure. (It's ok to use `python` inside a virtualenv though.) Signed-off-by: Ihar Hrachyshka <[email protected]>
34ce3a9
to
97fc099
Compare
Before the patch, all errors from openai were handled by retrying up to API_MAX_RETRY times and returning $ERROR$ message at the last attempt. With this patch, if all attempts result in APIConnectionError, we raise a new EvalError exception. (If at least one of the previous attempts result in a different error, then we return $ERROR$ as usual.) Also, several errors are not expected to recover with a retry (400-404, 422). This patch makes them return $ERROR$ immediately without retrying. Closes: instructlab#77 Signed-off-by: Ihar Hrachyshka <[email protected]>
Before the patch, we were calculating them on every retry attempt. The function is pure, so there is no good reason to repeat the calculation. This also simplifies the function a bit. Signed-off-by: Ihar Hrachyshka <[email protected]>
I'm leaving this as draft still until I have a chance to run it against a real server. (Will do it tomorrow.) Code wise, I hope this at least reflects the latest discussions and intent though. (I also included a minor change where I split out |
97fc099
to
12feb99
Compare
@@ -90,3 +90,15 @@ def __init__(self, tasks_dir) -> None: | |||
super().__init__() | |||
self.tasks_dir = tasks_dir | |||
self.message = f"Invalid Tasks Dir: {tasks_dir}" | |||
|
|||
|
|||
class OpenAIError(EvalError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the underlying API could change, maybe this should be more generic. ModelServingAPIError?
openai.InternalServerError, # >=500 | ||
# NOTE: Errors listed below may need a revisit: we are not sure if | ||
# it's ever helpful to retry them. Leaving them intact for now. | ||
openai.AuthenticationError, # 401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were saying the 40xs would fit into the is fatal category.
@@ -272,14 +302,38 @@ def chat_completion_openai( | |||
) | |||
output = response.choices[0].message.content | |||
break | |||
except openai.OpenAIError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need to handle this exception as a catchall. In case they add a new exception type for example.
Closing in favor of #103 |
Note: TimeoutError is a subclass of the generic connection error, and we'd like to retry for timeouts.
This patch also rearranges the code a bit, incl. making sure that it won't sleep at the very last iteration when we know there won't be another retry attempt.
Closes: #77