Add a PipelineContext class #117

markmc · 2024-07-11T22:54:21Z

(This is based on #116 which needs to merge first)

First things first, the multiprocessing workarounds are horrible. Ideas welcome for how to avoid the OpenAI client getting sucked into the data serialized to map/filter workers!

49c87d5 Add a PipelineContext class
7cfbaa9 Fix multiprocessing issues in FilterByValueBlock
9d92548 Fix multiprocessing issues in utilblocks

commit 49c87d5

    Add a PipelineContext class
    
    In order to prepare for pipeline definitions in YAML, remove
    runtime parameters like the OpenAI client, model ID, and model
    family from the pipeline definition into a PipelineContext
    object that all blocks have access to.

commit 7cfbaa9

    Fix multiprocessing issues in FilterByValueBlock
    
    This addresses issues with using num_proc>1 with Dataset.map()
    and Dataset.filter().
    
    The first issue is:
    
      File "/usr/lib64/python3.11/pickle.py", line 578, in save
        rv = reduce(self.proto)
             ^^^^^^^^^^^^^^^^^^
    TypeError: cannot pickle 'SSLContext' object
   
    What was happening here is that the entire FilterByValueBlock
    object was being serialized to send to the multiprocessing
    worker. And now that this includes PipelineContext, which
    includes the OpenAI client object, which includes SSLContext,
    we hit a known issue: uqfoundation/dill#308
    
    The second issue is specific to map():
    
    ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
    
    It appears the the datasets, only in the case of num_proc>1,
    when we hit the "error converting dtype" case and set the column
    to None, it ends up being still considered a string column rather
    than the new dtype.
    
    This second issue deserves further investigation and may require
    a fix to the datasets library.

commit 9d92548 (HEAD -> pipeline-context, markmc/pipeline-context)

    Fix multiprocessing issues in utilblocks
    
    Address the following issue with using num_proc>1 with Dataset.map():
    
    File "/usr/lib64/python3.11/pickle.py", line 578, in save
        rv = reduce(self.proto)
             ^^^^^^^^^^^^^^^^^^
    TypeError: cannot pickle 'SSLContext' object
    
    The entire block object is being serialized to sent to the
    multiprocessing worker. And now that this includes PipelineContext,
    which includes the OpenAI client object, which includes SSLContext,
    we hit a known issue: uqfoundation/dill#308

In preparation for custom pipeline configuration files, do not require model_prompt as an LLMBlock param - it can have built-in knowledge of the correct prompt to use per model_family. Signed-off-by: Mark McLoughlin <[email protected]>

In order to prepare for pipeline definitions in YAML, remove runtime parameters like the OpenAI client, model ID, and model family from the pipeline definition into a PipelineContext object that all blocks have access to. Signed-off-by: Mark McLoughlin <[email protected]>

This addresses issues with using num_proc>1 with Dataset.map() and Dataset.filter(). The first issue is: ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` What was happening here is that the entire FilterByValueBlock object was being serialized to send to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 The second issue is specific to map(): ``` ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null"). ``` It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new dtype. This second issue deserves further investigation and may require a fix to the datasets library. Signed-off-by: Mark McLoughlin <[email protected]>

Address the following issue with using num_proc>1 with Dataset.map(): ``` File "/usr/lib64/python3.11/pickle.py", line 578, in save rv = reduce(self.proto) ^^^^^^^^^^^^^^^^^^ TypeError: cannot pickle 'SSLContext' object ``` The entire block object is being serialized to sent to the multiprocessing worker. And now that this includes PipelineContext, which includes the OpenAI client object, which includes SSLContext, we hit a known issue: uqfoundation/dill#308 Signed-off-by: Mark McLoughlin <[email protected]>

github-actions · 2024-07-11T23:10:26Z

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

github-actions · 2024-07-11T23:26:31Z

e2e workflow succeeded on this PR: View run, congrats!

russellb · 2024-07-12T01:29:52Z

src/instructlab/sdg/filterblock.py

+    #   ValueError: The features can't be aligned ...
+    # because the column is still considered a string not
+    # the new dtype.
+    num_proc = 1


Thanks for the details, both in the commit message and the comment block here. I filed an issue here to track following up on this: #123

I wonder if setting the column to None just isn't a valid way to handle the case when we can't convert to the target datatype. Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.

Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.

anyway ... issue filed to look closer

Thanks for filing that issue - it's good practice to file an issue for any FIXME like that, but somehow I'm reluctant to file an issue for something that hasn't merged yet! But you're right ... I probably would never have filed the issue

On "is None valid" - see the comment I left in the issue. I think it is, but None handling was known to be buggy at one point. It feels like we've just hit a bug.

And yeah, I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.

block_type: FilterByValueBlock block_config: block_name: filter_questions filter_column: score filter_value: 1.0 invalid_value: 0 operation: eq convert_dtype: float

Best to continue this discussion #123, I think 👍

russellb

Thanks.

The biggest change seemed like a lot, but it was a lot of moving the same few pieces of data so I was able to read through it pretty quickly the way you had it in its own commit.

These multiprocessing issues seem really tricky. Thank you for the detailed explanation of what's going on, and the various comments in the code.

You noted the one bug we need to fix, but it looks like you have a workaround in place, so I don't think we need to block on it. I filed an issue here for that: #123

markmc · 2024-07-12T16:12:13Z

Thanks for the review @russellb! Closing in favor of just merging #86 in one shot

markmc added 4 commits July 11, 2024 23:32

markmc changed the title ~~Pipeline context~~ Add a PipelineContext class Jul 11, 2024

mergify bot added the testing Relates to testing label Jul 11, 2024

markmc mentioned this pull request Jul 11, 2024

Allow block_config.config_path to be relative #118

Closed

russellb mentioned this pull request Jul 12, 2024

Investigate Dataset.map() multiprocessing failure #123

Closed

russellb reviewed Jul 12, 2024

View reviewed changes

russellb approved these changes Jul 12, 2024

View reviewed changes

markmc mentioned this pull request Jul 12, 2024

Add a YAML based file format for pipelines #86

Merged

markmc closed this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a PipelineContext class #117

Add a PipelineContext class #117

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb Jul 12, 2024

markmc Jul 12, 2024

markmc Jul 12, 2024

russellb left a comment

markmc commented Jul 12, 2024

Add a PipelineContext class #117

Add a PipelineContext class #117

Conversation

markmc commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

russellb Jul 12, 2024

Choose a reason for hiding this comment

markmc Jul 12, 2024

Choose a reason for hiding this comment

markmc Jul 12, 2024

Choose a reason for hiding this comment

russellb left a comment

Choose a reason for hiding this comment

markmc commented Jul 12, 2024