Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset generation error when num_demos is equal to demos_pool_size #1416

Open
yifanmai opened this issue Dec 7, 2024 · 5 comments
Open

Comments

@yifanmai
Copy link
Contributor

yifanmai commented Dec 7, 2024

I would expect load_dataset() to work normally when num_demos is equal to demos_pool_size, such as in the following code snippet:

from unitxt import load_dataset

dataset = load_dataset(card="cards.wnli", template_card_index=0, num_demos=5, demos_pool_size=5)

Instead, I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yifanmai/oss/unitxt/src/unitxt/api.py", line 136, in load_dataset
    return stream.to_dataset(
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 248, in to_dataset
    {
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 249, in <dictcomp>
    key: value.to_dataset(
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 40, in to_dataset
    return Dataset.from_generator(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1136, in from_generator
    ).read()
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/io/generator.py", line 49, in read
    self.builder.download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1029, in download_and_prepare
    self._download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1791, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1629, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1786, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

However, this works if I set demos_pool_size to num_demos + 1

from unitxt import load_dataset

dataset = load_dataset(card="cards.wnli", template_card_index=0, num_demos=5, demos_pool_size=6)

This suggests that there is an off by one error in the sampling logic.

This is using unitxt's main branch from the repository at githash 81f0da1adb16dc7ff7e83eb8dd9fe8d9ac594aac on Python 3.10.13.

@dafnapension
Copy link
Collaborator

dafnapension commented Dec 12, 2024

Hi @yoavkatz , I reproduced, and first bumped into a related error:

  File "/home/dafna/workspaces/unitxt/src/unitxt/splitters.py", line 366, in process
    f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
AttributeError: 'RandomSampler' object has no attribute 'sample_size'

Here we need to fix to:

 f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {sample_size}."

The more involved problem is the filtering done in lines:

            source_stream = self.sampler.filter_source_by_instance(
                source_stream, instance
            )

Which filters out candidate demos if they are identical in their field called input fields to the instance being processed. This is the cause of the error reported in this issue: One of the candidate demos (an instance from the stream typically called demos_pool) is identical in its input_fields to one of the non-demo instances. Because the numbers are tight (num_demos == demos_pool_size), that (potentially single) processed instance throws the exception.

We may want to consider:
(1) Cleaning up the demos_pool itself from identical-input-fields instances (i.e., ensure all the demos are distinct in their input-fields).
(2) Allow a slack of size 1 to cope with a case of processed instance that has the same input fields as one of the candidate demos.

@yoavkatz , please advise.

@yoavkatz
Copy link
Member

Why are we trying to sample demonstrations for instances in the demo pool? Perhaps this is the issue. The demo pool instances need not be processed.

@dafnapension
Copy link
Collaborator

Thank you, @yoavkatz . I think this is a different issue that should be solved in reset_pipeline of Recipe.
Perhaps add to each of the options in the case of if self.use_demos
the parameter: dont_apply_to_streams (following a needed simplification to the special type InstanceOperatorWithMultiStreamAccess).

But here we have other issues:
(1) Should we select the demo instances so that they are distinct? or simply slice from train, and live with whatever is sliced.
(2) how to avoid an exception when filtered-out demo instances drop their number below the needed.

@dafnapension
Copy link
Collaborator

dafnapension commented Dec 13, 2024

I suggest to solve all problems by generating the demos not as a separate stream, but more like a fixed piece of data, to be held by a multi-stream operator that will first generate it (again -- into a fixed list) and then distribute it over all instances. Very much like operator ExtractFieldValues.
When generating, the operator will select the first distinct demos_pool_size instances from train.
It will make sure that demos_pool_size is > num_demos , so that if an ordinary instance is bumped into, which happens to be identical to one of the selected demos_pool_slize, an exception will not be thrown, as happened in this issue.
`
@yoavkatz , what do you think?

@dafnapension
Copy link
Collaborator

@yoavkatz , I prepared a PR to see how hard it is to implement the demos_pool not as a separate stream, to solve all the problems discussed here. It seems not too complicated: #1436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants