Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry on connection error #80

Closed

Conversation

booxter
Copy link
Contributor

@booxter booxter commented Jul 23, 2024

Note: TimeoutError is a subclass of the generic connection error, and we'd like to retry for timeouts.

This patch also rearranges the code a bit, incl. making sure that it won't sleep at the very last iteration when we know there won't be another retry attempt.

Closes: #77

README.md Show resolved Hide resolved
@booxter booxter force-pushed the dont-retry-on-connection-failure branch 2 times, most recently from 08e0c46 to baae1c3 Compare July 23, 2024 18:16
@booxter
Copy link
Contributor Author

booxter commented Jul 23, 2024

@khaledsulayman

README.md Show resolved Hide resolved
src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved
@booxter booxter force-pushed the dont-retry-on-connection-failure branch from baae1c3 to 34ce3a9 Compare July 23, 2024 22:34
@booxter booxter requested a review from khaledsulayman July 23, 2024 22:49
@alimaredia
Copy link
Contributor

I think one of the reasons for being liberal with the exception handling at this level is that it's not uncommon in runs of MT-Bench to see 400's get returned randomly in a sea of 200's.

Screenshot from 2024-07-25 09-47-54

That being said it seems like the errors you're handling are not 400s so this fix seems helpful.
@danmcp any thoughts here?

return response.choices[0].message.content
except openai.APITimeoutError as ex:
logger.debug(ex)
except openai.APIConnectionError as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One error condition we should be able to create is overloading llama-cpp cpu serving with max-workers. That tends to create a service unavailable condition. Would be good to know where that hits in here.

Also, when calling against the openai api, do we know for sure this isn't a condition we hit during normal use? They were retrying up to 16 times previously but I don't know the common errors returned.

Copy link
Contributor Author

@booxter booxter Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know how often 40x return from API. I will rely on your judgement. (Which I think suggests it's a common scenario.)


My primary goal is to make the code that calls this function to fail (to report back to CI that it's broken). Without it, e2e runs claim success when in reality evaluation crashed. I think CI status should be indicative of a failure - one should not have to drill down into logs to double-check a job actually worked.

But: I think this can be achieved while leaving the retry logic for APIConnectionError in place. Just need to raise instead of return output. This will allow us to retry for some time but still fail if an error happened.

What do you think @danmcp ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:

  • Getting 400s because of max context length issues. You would get this using granite as the judge model for example.
  • llama-ccp cpu serving being overloaded and returning service unavailable.

Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.

For the context length issues additional improvements might be:

  • Don't retry when you hit the issue the first time but record the error and move on
  • Default to failing fast with this condition and allow for an environment variable to override

The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.

@danmcp
Copy link
Member

danmcp commented Jul 25, 2024

I think one of the reasons for being liberal with the exception handling at this level is that it's not uncommon in runs of MT-Bench to see 400's get returned randomly in a sea of 200's.
@danmcp any thoughts here?

The 400s are mostly from context length issues. Those shouldn't be an issue with an appropriate judge model. We actually reduced the hardcoded retries constant from 16 to 4 as we don't experience so many failures with a local model vs the public api. We do need a way to tolerate those errors for the purposes of testing still. If we move completely to a fail fast model. We'll probably need an env var option to enable testing.

I really appreciate trying to improve this logic further and think the effort to pick out specific errors that should fail fast is a good endeavor. The original code was AFAICT designed around managing frequent errors but sacrificing consistency and completeness. Since these are such long running processes and fit into the middle of ever longer running processes, I am not sure yet we can fail completely when we do get sporadic failures.

@khaledsulayman
Copy link
Member

@danmcp what if we were to add some handling on the final retry and raise something more descriptive of what actually went wrong? Whether we want to ignore a one-off error during a single run is one thing, but if we hit max retries and especially if it's all the same error, I think it makes sense to have enough granularity to be able to easily see what went wrong.

@danmcp
Copy link
Member

danmcp commented Jul 26, 2024

@danmcp what if we were to add some handling on the final retry and raise something more descriptive of what actually went wrong? Whether we want to ignore a one-off error during a single run is one thing, but if we hit max retries and especially if it's all the same error, I think it makes sense to have enough granularity to be able to easily see what went wrong.

As an example of an issue that's easy to hit is the context length of your judge model is too small. Something around 5120 is required by mt-bench. In those cases, for the conversations that require larger than your judge model supports, it's going to fail consistently on those questions. We then have a choice of failing and raising the exception or failing and counting in the error rate. The original logic ignored the error and so we added the error rate to at least account for it. And other errors are possible too such as the model might not give an answer in the expected format. If we change the logic to fail fast on errors, it could throw away hours of runtime. If everything is perfectly tuned, this might be fine. But we would need to prove that can be accomplished reliably before we can make such a change.

I think some easy wins are:

  • Catch specific exceptions as is being done in this PR which indicate the system isn't working at all. We just need to make sure the exceptions indicate complete failure and not flakiness.
  • Adding an option to not raise exceptions on a particular class of problems. Ex: context length of the model isn't long enough. Anything that implied you weren't going to get valid results could fit into this bucket. We need something like this for testing purposes to be able to use smaller models with testing.
  • Add tuning for the number of retries to attempt before giving up.

@booxter
Copy link
Contributor Author

booxter commented Jul 26, 2024

@danmcp "Adding an option to not raise exceptions on a particular class of problems. Ex: context length of the model isn't long enough." I think the issue is that there's no specific type to catch for such particular error. (But we can check what we have.)

I think in general - 1) the library should raise on error; 2) the caller should decide what to do with the error (log, crash, etc.) We can make callers track exceptions and, at the end of the eval, use the fact that any errors happened as the reason to exit with non-zero value (which should tell the test script it failed.)

btw I noticed that openai library apparently already retries and there's a tunable to control how frequently it does. https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/src/openai/__init__.py#L117

return response.choices[0].message.content
except openai.APITimeoutError as ex:
logger.debug(ex)
except openai.APIConnectionError as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:

  • Getting 400s because of max context length issues. You would get this using granite as the judge model for example.
  • llama-ccp cpu serving being overloaded and returning service unavailable.

Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.

For the context length issues additional improvements might be:

  • Don't retry when you hit the issue the first time but record the error and move on
  • Default to failing fast with this condition and allow for an environment variable to override

The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.

@danmcp
Copy link
Member

danmcp commented Jul 26, 2024

I think in general - 1) the library should raise on error; 2) the caller should decide what to do with the error (log, crash, etc.) We can make callers track exceptions and, at the end of the eval, use the fact that any errors happened as the reason to exit with non-zero value (which should tell the test script it failed.)

I am not quite sure how this would work. To be able to make the decision on whether to continue on error, we couldn't raise an exception all the way outside the library. At that point it's too late to continue. We could have some sort of callback mechanism I guess.

At a high level I think we can all agree it's not great the code is eating errors. More granular error handling should be able to differentiate the different types of errors and stop eating any errors that indicate something is broken vs hitting iffy configurations vs flakiness in various serving options. But it's not clear that passing the errors on regardless of type solves the general problem. The purpose of the library is to execute a benchmark. In this case, it's a long running benchmark and some of the questions being asked by the benchmark may fail to get a valid response. The library currently pushes the responsibility to the user to decide whether the error rate is acceptable.

So unless we get to the point where all api errors can be eliminated for long runs, it's not clear yet how we can do much better than isolating the failures which should fail fast (like model not found).

Letting the caller make a decision with a callback, or with exception handling (not sure how it would work still), or with env variable options, or by passing in options to declare what types of error should fail are all options we could support. I am in favor of using such options to be able to fail for things like the max context length issue by default with an option to override.

If there is a suggestion to fail for any exception after retries, we just need to explain how it's going to work for long running jobs with potentially small error rates as well as work for testing with smaller context length models.

@booxter
Copy link
Contributor Author

booxter commented Jul 26, 2024

@booxter
Copy link
Contributor Author

booxter commented Jul 26, 2024

Thanks for this, I think I start understanding larger context here. A strict exception discipline won't do when you'd like work to continue even after a bad answer is returned.


I think ultimately what's missing is gen_answers interface of an evaluator doesn't have any way to communicate the presence of errors in the output data. We could convert the type of the entrypoint to -> bool, then use this value to inform the caller that the produced data set has issues.

Another venue to explore could be: the library uses file system as interface to pass generated data to the caller. It could instead return the produced dataset (via yield?) and let the caller decide if it wants to accept a particular answer. I don't think it's the job of the library to write data onto disk...

Ignoring what can be done in this library, the caller could probably open the written file and check if it contains any error markers. If so, make sure an error return code is returned. (When it's ready.)


I will take some time to think on the issue. But if you have some preference or thoughts on the above alternatives, please let me know.

@danmcp
Copy link
Member

danmcp commented Jul 26, 2024

I think ultimately what's missing is gen_answers interface of an evaluator doesn't have any way to communicate the presence of errors in the output data. We could convert the type of the entrypoint to -> bool, then use this value to inform the caller that the produced data set has issues.

That's a really good point. I've been thinking about this in terms of the openai api and the error rates we expose from judgment. But we don't expose error rates on gen answers today. Giving an error rate might be an appropriate return value? That would match what we do for judgment.

Another venue to explore could be: the library uses file system as interface to pass generated data to the caller. It could instead return the produced dataset (via yield?) and let the caller decide if it wants to accept a particular answer. I don't think it's the job of the library to write data onto disk...

Ignoring what can be done in this library, the caller could probably open the written file and check if it contains any error markers. If so, make sure an error return code is returned. (When it's ready.)

I will take some time to think on the issue. But if you have some preference or thoughts on the above alternatives, please let me know.

As of now, my preference is to:

  • Add error rate (or equivalent) to gen_answers for completeness
  • Isolate errors from https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/README.md?plain=1#L367-L376 which should fail fast or fail after some number of retries.
    Note I checked and we are getting <class 'openai.InternalServerError'> Service Unavailable with llama-cpp and cpu serving. So I think your current impl of raising on APIConnectionError is appropriate. I am slightly worried about not having the retries even that case though. I assume we would get a similar error on network blip.
    • 400 - discussed below. Either need to allow or an option to allow.
    • 401-404 - fail after retries
    • 422-500 - include in error rate
  • Leave the -1 and "$ERROR$" in place for conditions we can't clearly classify as complete failures
  • Include your sleep fix

Can be fixed later, if we do we should open an issue

  • Create a way for some error codes to be allowed for testing purposes. Namely allow the 400s. And switch these cases to fail fast by default.

@booxter
Copy link
Contributor Author

booxter commented Jul 29, 2024

For now, a more limited approach just making the tests fail on errors in the output file: #85 while we think through a way to pass errors or error rate directly.

@booxter
Copy link
Contributor Author

booxter commented Jul 29, 2024

Also splitting out the sleep fix here: #84 since it's not directly related to this issue.

The `python` symlink may be missing on a system; or even point to py2.
We should use `python3` to be sure. (It's ok to use `python` inside a
virtualenv though.)

Signed-off-by: Ihar Hrachyshka <[email protected]>
@booxter booxter force-pushed the dont-retry-on-connection-failure branch from 34ce3a9 to 97fc099 Compare July 31, 2024 17:57
@booxter booxter marked this pull request as draft July 31, 2024 17:57
src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved
src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved
src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved
Before the patch, all errors from openai were handled by retrying up to
API_MAX_RETRY times and returning $ERROR$ message at the last attempt.

With this patch, if all attempts result in APIConnectionError, we raise
a new EvalError exception. (If at least one of the previous attempts
result in a different error, then we return $ERROR$ as usual.)

Also, several errors are not expected to recover with a retry (400-404,
422). This patch makes them return $ERROR$ immediately without retrying.

Closes: instructlab#77

Signed-off-by: Ihar Hrachyshka <[email protected]>
Before the patch, we were calculating them on every retry attempt. The
function is pure, so there is no good reason to repeat the calculation.

This also simplifies the function a bit.

Signed-off-by: Ihar Hrachyshka <[email protected]>
@booxter
Copy link
Contributor Author

booxter commented Jul 31, 2024

I'm leaving this as draft still until I have a chance to run it against a real server. (Will do it tomorrow.) Code wise, I hope this at least reflects the latest discussions and intent though. (I also included a minor change where I split out messages calculation out of the for loop; this is a separate commit to simplify review.)

@booxter booxter force-pushed the dont-retry-on-connection-failure branch from 97fc099 to 12feb99 Compare July 31, 2024 21:16
@@ -90,3 +90,15 @@ def __init__(self, tasks_dir) -> None:
super().__init__()
self.tasks_dir = tasks_dir
self.message = f"Invalid Tasks Dir: {tasks_dir}"


class OpenAIError(EvalError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the underlying API could change, maybe this should be more generic. ModelServingAPIError?

openai.InternalServerError, # >=500
# NOTE: Errors listed below may need a revisit: we are not sure if
# it's ever helpful to retry them. Leaving them intact for now.
openai.AuthenticationError, # 401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were saying the 40xs would fit into the is fatal category.

@@ -272,14 +302,38 @@ def chat_completion_openai(
)
output = response.choices[0].message.content
break
except openai.OpenAIError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to handle this exception as a catchall. In case they add a new exception type for example.

@nathan-weinberg
Copy link
Member

Closing in favor of #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_branch_gen_answers does not fail when no model is being served
5 participants