-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a PipelineContext class #117
Conversation
In preparation for custom pipeline configuration files, do not require model_prompt as an LLMBlock param - it can have built-in knowledge of the correct prompt to use per model_family. Signed-off-by: Mark McLoughlin <[email protected]>
In order to prepare for pipeline definitions in YAML, remove runtime parameters like the OpenAI client, model ID, and model family from the pipeline definition into a PipelineContext object that all blocks have access to. Signed-off-by: Mark McLoughlin <[email protected]>
This addresses issues with using num_proc>1 with Dataset.map() and Dataset.filter(). The first issue is: ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` What was happening here is that the entire FilterByValueBlock object was being serialized to send to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 The second issue is specific to map(): ``` ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null"). ``` It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new dtype. This second issue deserves further investigation and may require a fix to the datasets library. Signed-off-by: Mark McLoughlin <[email protected]>
Address the following issue with using num_proc>1 with Dataset.map(): ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` The entire block object is being serialized to sent to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 Signed-off-by: Mark McLoughlin <[email protected]>
E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
# ValueError: The features can't be aligned ... | ||
# because the column is still considered a string not | ||
# the new dtype. | ||
num_proc = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details, both in the commit message and the comment block here. I filed an issue here to track following up on this: #123
I wonder if setting the column to None
just isn't a valid way to handle the case when we can't convert to the target datatype. Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.
Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.
anyway ... issue filed to look closer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for filing that issue - it's good practice to file an issue for any FIXME
like that, but somehow I'm reluctant to file an issue for something that hasn't merged yet! But you're right ... I probably would never have filed the issue
On "is None valid" - see the comment I left in the issue. I think it is, but None handling was known to be buggy at one point. It feels like we've just hit a bug.
And yeah, I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.
I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.
block_type: FilterByValueBlock
block_config:
block_name: filter_questions
filter_column: score
filter_value: 1.0
invalid_value: 0
operation: eq
convert_dtype: float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to continue this discussion #123, I think 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
The biggest change seemed like a lot, but it was a lot of moving the same few pieces of data so I was able to read through it pretty quickly the way you had it in its own commit.
These multiprocessing issues seem really tricky. Thank you for the detailed explanation of what's going on, and the various comments in the code.
You noted the one bug we need to fix, but it looks like you have a workaround in place, so I don't think we need to block on it. I filed an issue here for that: #123
(This is based on #116 which needs to merge first)
First things first, the multiprocessing workarounds are horrible. Ideas welcome for how to avoid the OpenAI client getting sucked into the data serialized to map/filter workers!
49c87d5 Add a PipelineContext class
7cfbaa9 Fix multiprocessing issues in FilterByValueBlock
9d92548 Fix multiprocessing issues in utilblocks