Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a PipelineContext class #117

Closed
wants to merge 4 commits into from
Closed

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Jul 11, 2024

(This is based on #116 which needs to merge first)

First things first, the multiprocessing workarounds are horrible. Ideas welcome for how to avoid the OpenAI client getting sucked into the data serialized to map/filter workers!

49c87d5 Add a PipelineContext class
7cfbaa9 Fix multiprocessing issues in FilterByValueBlock
9d92548 Fix multiprocessing issues in utilblocks

commit 49c87d5

    Add a PipelineContext class
    
    In order to prepare for pipeline definitions in YAML, remove
    runtime parameters like the OpenAI client, model ID, and model
    family from the pipeline definition into a PipelineContext
    object that all blocks have access to.
commit 7cfbaa9

    Fix multiprocessing issues in FilterByValueBlock
    
    This addresses issues with using num_proc>1 with Dataset.map()
    and Dataset.filter().
    
    The first issue is:
    
      File "/usr/lib64/python3.11/pickle.py", line 578, in save
        rv = reduce(self.proto)
             ^^^^^^^^^^^^^^^^^^
    TypeError: cannot pickle 'SSLContext' object
   
    What was happening here is that the entire FilterByValueBlock
    object was being serialized to send to the multiprocessing
    worker. And now that this includes PipelineContext, which
    includes the OpenAI client object, which includes SSLContext,
    we hit a known issue: uqfoundation/dill#308
    
    The second issue is specific to map():
    
    ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
    
    It appears the the datasets, only in the case of num_proc>1,
    when we hit the "error converting dtype" case and set the column
    to None, it ends up being still considered a string column rather
    than the new dtype.
    
    This second issue deserves further investigation and may require
    a fix to the datasets library.
commit 9d92548 (HEAD -> pipeline-context, markmc/pipeline-context)

    Fix multiprocessing issues in utilblocks
    
    Address the following issue with using num_proc>1 with Dataset.map():
    
    File "/usr/lib64/python3.11/pickle.py", line 578, in save
        rv = reduce(self.proto)
             ^^^^^^^^^^^^^^^^^^
    TypeError: cannot pickle 'SSLContext' object
    
    The entire block object is being serialized to sent to the
    multiprocessing worker. And now that this includes PipelineContext,
    which includes the OpenAI client object, which includes SSLContext,
    we hit a known issue: uqfoundation/dill#308

markmc added 4 commits July 11, 2024 23:32
In preparation for custom pipeline configuration files, do not require
model_prompt as an LLMBlock param - it can have built-in knowledge
of the correct prompt to use per model_family.

Signed-off-by: Mark McLoughlin <[email protected]>
In order to prepare for pipeline definitions in YAML, remove
runtime parameters like the OpenAI client, model ID, and model
family from the pipeline definition into a PipelineContext
object that all blocks have access to.

Signed-off-by: Mark McLoughlin <[email protected]>
This addresses issues with using num_proc>1 with Dataset.map()
and Dataset.filter().

The first issue is:

```
  File "/usr/lib64/python3.11/pickle.py", line 578, in save
    rv = reduce(self.proto)
         ^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'SSLContext' object
```

What was happening here is that the entire FilterByValueBlock
object was being serialized to send to the multiprocessing
worker. And now that this includes PipelineContext, which
includes the OpenAI client object, which includes SSLContext,
we hit a known issue: uqfoundation/dill#308

The second issue is specific to map():

```
ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
```

It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new dtype.

This second issue deserves further investigation and may require
a fix to the datasets library.

Signed-off-by: Mark McLoughlin <[email protected]>
Address the following issue with using num_proc>1 with Dataset.map():

```
File "/usr/lib64/python3.11/pickle.py", line 578, in save
    rv = reduce(self.proto)
         ^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'SSLContext' object
```

The entire block object is being serialized to sent to the
multiprocessing worker. And now that this includes PipelineContext,
which includes the OpenAI client object, which includes SSLContext,
we hit a known issue: uqfoundation/dill#308

Signed-off-by: Mark McLoughlin <[email protected]>
@markmc markmc changed the title Pipeline context Add a PipelineContext class Jul 11, 2024
@mergify mergify bot added the testing Relates to testing label Jul 11, 2024
Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

# ValueError: The features can't be aligned ...
# because the column is still considered a string not
# the new dtype.
num_proc = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details, both in the commit message and the comment block here. I filed an issue here to track following up on this: #123

I wonder if setting the column to None just isn't a valid way to handle the case when we can't convert to the target datatype. Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.

Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.

anyway ... issue filed to look closer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filing that issue - it's good practice to file an issue for any FIXME like that, but somehow I'm reluctant to file an issue for something that hasn't merged yet! But you're right ... I probably would never have filed the issue

On "is None valid" - see the comment I left in the issue. I think it is, but None handling was known to be buggy at one point. It feels like we've just hit a bug.

And yeah, I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.

block_type: FilterByValueBlock
   block_config:
     block_name: filter_questions
     filter_column: score
     filter_value: 1.0
     invalid_value: 0
     operation: eq
     convert_dtype: float

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to continue this discussion #123, I think 👍

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

The biggest change seemed like a lot, but it was a lot of moving the same few pieces of data so I was able to read through it pretty quickly the way you had it in its own commit.

These multiprocessing issues seem really tricky. Thank you for the detailed explanation of what's going on, and the various comments in the code.

You noted the one bug we need to fix, but it looks like you have a workaround in place, so I don't think we need to block on it. I filed an issue here for that: #123

@markmc
Copy link
Contributor Author

markmc commented Jul 12, 2024

Thanks for the review @russellb! Closing in favor of just merging #86 in one shot

@markmc markmc closed this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants