Don't retry on connection error #80

booxter · 2024-07-23T18:14:01Z

Note: TimeoutError is a subclass of the generic connection error, and we'd like to retry for timeouts.

This patch also rearranges the code a bit, incl. making sure that it won't sleep at the very last iteration when we know there won't be another retry attempt.

Closes: #77

README.md

src/instructlab/eval/mt_bench_common.py

booxter · 2024-07-23T18:18:11Z

@khaledsulayman

README.md

src/instructlab/eval/mt_bench_common.py

alimaredia · 2024-07-25T13:48:17Z

I think one of the reasons for being liberal with the exception handling at this level is that it's not uncommon in runs of MT-Bench to see 400's get returned randomly in a sea of 200's.

That being said it seems like the errors you're handling are not 400s so this fix seems helpful.
@danmcp any thoughts here?

danmcp · 2024-07-25T18:39:17Z

src/instructlab/eval/mt_bench_common.py

+            return response.choices[0].message.content
+        except openai.APITimeoutError as ex:
+            logger.debug(ex)
+        except openai.APIConnectionError as ex:


One error condition we should be able to create is overloading llama-cpp cpu serving with max-workers. That tends to create a service unavailable condition. Would be good to know where that hits in here.

Also, when calling against the openai api, do we know for sure this isn't a condition we hit during normal use? They were retrying up to 16 times previously but I don't know the common errors returned.

I honestly don't know how often 40x return from API. I will rely on your judgement. (Which I think suggests it's a common scenario.)

My primary goal is to make the code that calls this function to fail (to report back to CI that it's broken). Without it, e2e runs claim success when in reality evaluation crashed. I think CI status should be indicative of a failure - one should not have to drill down into logs to double-check a job actually worked.

But: I think this can be achieved while leaving the retry logic for APIConnectionError in place. Just need to raise instead of return output. This will allow us to retry for some time but still fail if an error happened.

What do you think @danmcp ?

I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:

Getting 400s because of max context length issues. You would get this using granite as the judge model for example.

llama-ccp cpu serving being overloaded and returning service unavailable.

Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.

For the context length issues additional improvements might be:

Don't retry when you hit the issue the first time but record the error and move on

Default to failing fast with this condition and allow for an environment variable to override

The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.

src/instructlab/eval/mt_bench_common.py

danmcp · 2024-07-25T18:59:52Z

I think one of the reasons for being liberal with the exception handling at this level is that it's not uncommon in runs of MT-Bench to see 400's get returned randomly in a sea of 200's.
@danmcp any thoughts here?

The 400s are mostly from context length issues. Those shouldn't be an issue with an appropriate judge model. We actually reduced the hardcoded retries constant from 16 to 4 as we don't experience so many failures with a local model vs the public api. We do need a way to tolerate those errors for the purposes of testing still. If we move completely to a fail fast model. We'll probably need an env var option to enable testing.

I really appreciate trying to improve this logic further and think the effort to pick out specific errors that should fail fast is a good endeavor. The original code was AFAICT designed around managing frequent errors but sacrificing consistency and completeness. Since these are such long running processes and fit into the middle of ever longer running processes, I am not sure yet we can fail completely when we do get sporadic failures.

khaledsulayman · 2024-07-26T17:48:49Z

@danmcp what if we were to add some handling on the final retry and raise something more descriptive of what actually went wrong? Whether we want to ignore a one-off error during a single run is one thing, but if we hit max retries and especially if it's all the same error, I think it makes sense to have enough granularity to be able to easily see what went wrong.

danmcp · 2024-07-26T18:10:06Z

@danmcp what if we were to add some handling on the final retry and raise something more descriptive of what actually went wrong? Whether we want to ignore a one-off error during a single run is one thing, but if we hit max retries and especially if it's all the same error, I think it makes sense to have enough granularity to be able to easily see what went wrong.

As an example of an issue that's easy to hit is the context length of your judge model is too small. Something around 5120 is required by mt-bench. In those cases, for the conversations that require larger than your judge model supports, it's going to fail consistently on those questions. We then have a choice of failing and raising the exception or failing and counting in the error rate. The original logic ignored the error and so we added the error rate to at least account for it. And other errors are possible too such as the model might not give an answer in the expected format. If we change the logic to fail fast on errors, it could throw away hours of runtime. If everything is perfectly tuned, this might be fine. But we would need to prove that can be accomplished reliably before we can make such a change.

I think some easy wins are:

Catch specific exceptions as is being done in this PR which indicate the system isn't working at all. We just need to make sure the exceptions indicate complete failure and not flakiness.
Adding an option to not raise exceptions on a particular class of problems. Ex: context length of the model isn't long enough. Anything that implied you weren't going to get valid results could fit into this bucket. We need something like this for testing purposes to be able to use smaller models with testing.
Add tuning for the number of retries to attempt before giving up.

booxter · 2024-07-26T18:21:22Z

@danmcp "Adding an option to not raise exceptions on a particular class of problems. Ex: context length of the model isn't long enough." I think the issue is that there's no specific type to catch for such particular error. (But we can check what we have.)

I think in general - 1) the library should raise on error; 2) the caller should decide what to do with the error (log, crash, etc.) We can make callers track exceptions and, at the end of the eval, use the fact that any errors happened as the reason to exit with non-zero value (which should tell the test script it failed.)

btw I noticed that openai library apparently already retries and there's a tunable to control how frequently it does. https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/src/openai/__init__.py#L117

danmcp · 2024-07-26T18:23:42Z

src/instructlab/eval/mt_bench_common.py

+            return response.choices[0].message.content
+        except openai.APITimeoutError as ex:
+            logger.debug(ex)
+        except openai.APIConnectionError as ex:


I am a +1 to failing fast for any conditions which are known to be invalid. No model being served is one of those. As long as we can differentiate that case, let's do it. The two conditions we know of that we need to make sure still work are:

Getting 400s because of max context length issues. You would get this using granite as the judge model for example.

llama-ccp cpu serving being overloaded and returning service unavailable.

Neither of those conditions should fail fast and we just need to make sure they don't hit this code block. I would assume the 400 error does not. But I don't know about the service unavailable.

For the context length issues additional improvements might be:

Don't retry when you hit the issue the first time but record the error and move on

Default to failing fast with this condition and allow for an environment variable to override

The case I am not comfortable with yet is failing after the retries run out vs recording the error in the error rate. There is more discussion on this in response to other comments.

danmcp · 2024-07-26T21:30:33Z

I think in general - 1) the library should raise on error; 2) the caller should decide what to do with the error (log, crash, etc.) We can make callers track exceptions and, at the end of the eval, use the fact that any errors happened as the reason to exit with non-zero value (which should tell the test script it failed.)

I am not quite sure how this would work. To be able to make the decision on whether to continue on error, we couldn't raise an exception all the way outside the library. At that point it's too late to continue. We could have some sort of callback mechanism I guess.

At a high level I think we can all agree it's not great the code is eating errors. More granular error handling should be able to differentiate the different types of errors and stop eating any errors that indicate something is broken vs hitting iffy configurations vs flakiness in various serving options. But it's not clear that passing the errors on regardless of type solves the general problem. The purpose of the library is to execute a benchmark. In this case, it's a long running benchmark and some of the questions being asked by the benchmark may fail to get a valid response. The library currently pushes the responsibility to the user to decide whether the error rate is acceptable.

So unless we get to the point where all api errors can be eliminated for long runs, it's not clear yet how we can do much better than isolating the failures which should fail fast (like model not found).

Letting the caller make a decision with a callback, or with exception handling (not sure how it would work still), or with env variable options, or by passing in options to declare what types of error should fail are all options we could support. I am in favor of using such options to be able to fail for things like the max context length issue by default with an option to override.

If there is a suggestion to fail for any exception after retries, we just need to explain how it's going to work for long running jobs with potentially small error rates as well as work for testing with smaller context length models.

booxter · 2024-07-26T21:32:30Z

For reference, here is the list of openai exceptions:

https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/README.md?plain=1#L367-L376

booxter · 2024-07-26T21:56:53Z

Thanks for this, I think I start understanding larger context here. A strict exception discipline won't do when you'd like work to continue even after a bad answer is returned.

I think ultimately what's missing is gen_answers interface of an evaluator doesn't have any way to communicate the presence of errors in the output data. We could convert the type of the entrypoint to -> bool, then use this value to inform the caller that the produced data set has issues.

Another venue to explore could be: the library uses file system as interface to pass generated data to the caller. It could instead return the produced dataset (via yield?) and let the caller decide if it wants to accept a particular answer. I don't think it's the job of the library to write data onto disk...

Ignoring what can be done in this library, the caller could probably open the written file and check if it contains any error markers. If so, make sure an error return code is returned. (When it's ready.)

I will take some time to think on the issue. But if you have some preference or thoughts on the above alternatives, please let me know.

danmcp · 2024-07-26T22:54:50Z

I think ultimately what's missing is gen_answers interface of an evaluator doesn't have any way to communicate the presence of errors in the output data. We could convert the type of the entrypoint to -> bool, then use this value to inform the caller that the produced data set has issues.

That's a really good point. I've been thinking about this in terms of the openai api and the error rates we expose from judgment. But we don't expose error rates on gen answers today. Giving an error rate might be an appropriate return value? That would match what we do for judgment.

Another venue to explore could be: the library uses file system as interface to pass generated data to the caller. It could instead return the produced dataset (via yield?) and let the caller decide if it wants to accept a particular answer. I don't think it's the job of the library to write data onto disk...

Ignoring what can be done in this library, the caller could probably open the written file and check if it contains any error markers. If so, make sure an error return code is returned. (When it's ready.)

I will take some time to think on the issue. But if you have some preference or thoughts on the above alternatives, please let me know.

As of now, my preference is to:

Add error rate (or equivalent) to gen_answers for completeness
Isolate errors from https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/README.md?plain=1#L367-L376 which should fail fast or fail after some number of retries.
Note I checked and we are getting <class 'openai.InternalServerError'> Service Unavailable with llama-cpp and cpu serving. So I think your current impl of raising on APIConnectionError is appropriate. I am slightly worried about not having the retries even that case though. I assume we would get a similar error on network blip.
- 400 - discussed below. Either need to allow or an option to allow.
- 401-404 - fail after retries
- 422-500 - include in error rate
Leave the -1 and "$ERROR$" in place for conditions we can't clearly classify as complete failures
Include your sleep fix

Can be fixed later, if we do we should open an issue

Create a way for some error codes to be allowed for testing purposes. Namely allow the 400s. And switch these cases to fail fast by default.

booxter · 2024-07-29T23:25:45Z

For now, a more limited approach just making the tests fail on errors in the output file: #85 while we think through a way to pass errors or error rate directly.

booxter · 2024-07-29T23:26:13Z

Also splitting out the sleep fix here: #84 since it's not directly related to this issue.

The `python` symlink may be missing on a system; or even point to py2. We should use `python3` to be sure. (It's ok to use `python` inside a virtualenv though.) Signed-off-by: Ihar Hrachyshka <[email protected]>

src/instructlab/eval/mt_bench_common.py

Before the patch, all errors from openai were handled by retrying up to API_MAX_RETRY times and returning $ERROR$ message at the last attempt. With this patch, if all attempts result in APIConnectionError, we raise a new EvalError exception. (If at least one of the previous attempts result in a different error, then we return $ERROR$ as usual.) Also, several errors are not expected to recover with a retry (400-404, 422). This patch makes them return $ERROR$ immediately without retrying. Closes: instructlab#77 Signed-off-by: Ihar Hrachyshka <[email protected]>

Before the patch, we were calculating them on every retry attempt. The function is pure, so there is no good reason to repeat the calculation. This also simplifies the function a bit. Signed-off-by: Ihar Hrachyshka <[email protected]>

booxter · 2024-07-31T21:16:06Z

I'm leaving this as draft still until I have a chance to run it against a real server. (Will do it tomorrow.) Code wise, I hope this at least reflects the latest discussions and intent though. (I also included a minor change where I split out messages calculation out of the for loop; this is a separate commit to simplify review.)

danmcp · 2024-07-31T22:13:54Z

src/instructlab/eval/exceptions.py

@@ -90,3 +90,15 @@ def __init__(self, tasks_dir) -> None:
        super().__init__()
        self.tasks_dir = tasks_dir
        self.message = f"Invalid Tasks Dir: {tasks_dir}"
+
+
+class OpenAIError(EvalError):


Since the underlying API could change, maybe this should be more generic. ModelServingAPIError?

danmcp · 2024-07-31T22:25:52Z

src/instructlab/eval/mt_bench_common.py

+            openai.InternalServerError,  # >=500
+            # NOTE: Errors listed below may need a revisit: we are not sure if
+            # it's ever helpful to retry them. Leaving them intact for now.
+            openai.AuthenticationError,  # 401


I thought we were saying the 40xs would fit into the is fatal category.

danmcp · 2024-07-31T22:26:58Z

src/instructlab/eval/mt_bench_common.py

@@ -272,14 +302,38 @@ def chat_completion_openai(
            )
            output = response.choices[0].message.content
            break
-        except openai.OpenAIError as e:


Don't we still need to handle this exception as a catchall. In case they add a new exception type for example.

nathan-weinberg · 2024-08-22T03:11:51Z

Closing in favor of #103

booxter mentioned this pull request Jul 23, 2024

allow openai exception handling to raise openai.APIConnectionError #78

Closed

booxter commented Jul 23, 2024

View reviewed changes

README.md Show resolved Hide resolved

booxter force-pushed the dont-retry-on-connection-failure branch 2 times, most recently from 08e0c46 to baae1c3 Compare July 23, 2024 18:16

booxter commented Jul 23, 2024

View reviewed changes

src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved

khaledsulayman reviewed Jul 23, 2024

View reviewed changes

README.md Show resolved Hide resolved

src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved

booxter force-pushed the dont-retry-on-connection-failure branch from baae1c3 to 34ce3a9 Compare July 23, 2024 22:34

booxter requested a review from khaledsulayman July 23, 2024 22:49

danmcp reviewed Jul 25, 2024

View reviewed changes

src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved

danmcp requested changes Jul 26, 2024

View reviewed changes

danmcp mentioned this pull request Jul 30, 2024

tests: Fail when model is not served #85

Closed

Use python3 in README instructions

72733bc

The `python` symlink may be missing on a system; or even point to py2. We should use `python3` to be sure. (It's ok to use `python` inside a virtualenv though.) Signed-off-by: Ihar Hrachyshka <[email protected]>

booxter force-pushed the dont-retry-on-connection-failure branch from 34ce3a9 to 97fc099 Compare July 31, 2024 17:57

booxter marked this pull request as draft July 31, 2024 17:57

khaledsulayman approved these changes Jul 31, 2024

View reviewed changes

danmcp requested changes Jul 31, 2024

View reviewed changes

src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved

src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved

src/instructlab/eval/mt_bench_common.py Outdated Show resolved Hide resolved

Calculate messages for openai completion once

12feb99

Before the patch, we were calculating them on every retry attempt. The function is pure, so there is no good reason to repeat the calculation. This also simplifies the function a bit. Signed-off-by: Ihar Hrachyshka <[email protected]>

booxter force-pushed the dont-retry-on-connection-failure branch from 97fc099 to 12feb99 Compare July 31, 2024 21:16

danmcp reviewed Jul 31, 2024

View reviewed changes

danmcp mentioned this pull request Aug 4, 2024

Remove remaining print()s from the library #86

Merged

danmcp mentioned this pull request Aug 16, 2024

Dont retry on connection failure #103

Merged

nathan-weinberg closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't retry on connection error #80

Don't retry on connection error #80

booxter commented Jul 23, 2024

booxter commented Jul 23, 2024

alimaredia commented Jul 25, 2024

danmcp Jul 25, 2024

booxter Jul 26, 2024 •

edited

Loading

danmcp Jul 26, 2024

danmcp commented Jul 25, 2024

khaledsulayman commented Jul 26, 2024

danmcp commented Jul 26, 2024

booxter commented Jul 26, 2024

danmcp Jul 26, 2024

danmcp commented Jul 26, 2024

booxter commented Jul 26, 2024

booxter commented Jul 26, 2024 •

edited

Loading

danmcp commented Jul 26, 2024

booxter commented Jul 29, 2024

booxter commented Jul 29, 2024

booxter commented Jul 31, 2024

danmcp Jul 31, 2024

danmcp Jul 31, 2024

danmcp Jul 31, 2024

nathan-weinberg commented Aug 22, 2024

Don't retry on connection error #80

Don't retry on connection error #80

Conversation

booxter commented Jul 23, 2024

booxter commented Jul 23, 2024

alimaredia commented Jul 25, 2024

danmcp Jul 25, 2024

Choose a reason for hiding this comment

booxter Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

danmcp Jul 26, 2024

Choose a reason for hiding this comment

danmcp commented Jul 25, 2024

khaledsulayman commented Jul 26, 2024

danmcp commented Jul 26, 2024

booxter commented Jul 26, 2024

danmcp Jul 26, 2024

Choose a reason for hiding this comment

danmcp commented Jul 26, 2024

booxter commented Jul 26, 2024

booxter commented Jul 26, 2024 • edited Loading

danmcp commented Jul 26, 2024

booxter commented Jul 29, 2024

booxter commented Jul 29, 2024

booxter commented Jul 31, 2024

danmcp Jul 31, 2024

Choose a reason for hiding this comment

danmcp Jul 31, 2024

Choose a reason for hiding this comment

danmcp Jul 31, 2024

Choose a reason for hiding this comment

nathan-weinberg commented Aug 22, 2024

booxter Jul 26, 2024 •

edited

Loading

booxter commented Jul 26, 2024 •

edited

Loading