Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix block_name handling #119

Closed
wants to merge 6 commits into from
Closed

Fix block_name handling #119

wants to merge 6 commits into from

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Jul 11, 2024

(Depends on #118)

All Block subclasses but LLMBlock are failing to pass the block_name from block_config down to the base class, instead they are incorrectly passing the block type as its name.

markmc added 6 commits July 11, 2024 23:32
In preparation for custom pipeline configuration files, do not require
model_prompt as an LLMBlock param - it can have built-in knowledge
of the correct prompt to use per model_family.

Signed-off-by: Mark McLoughlin <[email protected]>
In order to prepare for pipeline definitions in YAML, remove
runtime parameters like the OpenAI client, model ID, and model
family from the pipeline definition into a PipelineContext
object that all blocks have access to.

Signed-off-by: Mark McLoughlin <[email protected]>
This addresses issues with using num_proc>1 with Dataset.map()
and Dataset.filter().

The first issue is:

```
  File "/usr/lib64/python3.11/pickle.py", line 578, in save
    rv = reduce(self.proto)
         ^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'SSLContext' object
```

What was happening here is that the entire FilterByValueBlock
object was being serialized to send to the multiprocessing
worker. And now that this includes PipelineContext, which
includes the OpenAI client object, which includes SSLContext,
we hit a known issue: uqfoundation/dill#308

The second issue is specific to map():

```
ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
```

It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new dtype.

This second issue deserves further investigation and may require
a fix to the datasets library.

Signed-off-by: Mark McLoughlin <[email protected]>
Address the following issue with using num_proc>1 with Dataset.map():

```
File "/usr/lib64/python3.11/pickle.py", line 578, in save
    rv = reduce(self.proto)
         ^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'SSLContext' object
```

The entire block object is being serialized to sent to the
multiprocessing worker. And now that this includes PipelineContext,
which includes the OpenAI client object, which includes SSLContext,
we hit a known issue: uqfoundation/dill#308

Signed-off-by: Mark McLoughlin <[email protected]>
In order to remove another runtime parameter from pipeline
definitions to allow us to move to using YAML files.

Signed-off-by: Mark McLoughlin <[email protected]>
All Block subclasses but LLMBlock are failing to pass the
block_name from block_config down to the base class, instead
they are incorrectly passing the block type as its name.

Signed-off-by: Mark McLoughlin <[email protected]>
@mergify mergify bot added the testing Relates to testing label Jul 11, 2024
Copy link

E2E (NVIDIA A10G x4 - full pipeline) workflow launched on this PR: View run

Copy link

e2e workflow succeeded on this PR: View run, congrats!

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@markmc
Copy link
Contributor Author

markmc commented Jul 12, 2024

Thanks for the review @russellb! Closing in favor of just merging #86 in one shot

@markmc markmc closed this Jul 12, 2024
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
…_actions/rojopolis/spellcheck-github-actions-0.40.0

Bump rojopolis/spellcheck-github-actions from 0.38.0 to 0.40.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants