Dataset generation error when num_demos is equal to demos_pool_size #1416

yifanmai · 2024-12-07T00:27:50Z

I would expect load_dataset() to work normally when num_demos is equal to demos_pool_size, such as in the following code snippet:

from unitxt import load_dataset

dataset = load_dataset(card="cards.wnli", template_card_index=0, num_demos=5, demos_pool_size=5)

Instead, I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yifanmai/oss/unitxt/src/unitxt/api.py", line 136, in load_dataset
    return stream.to_dataset(
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 248, in to_dataset
    {
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 249, in <dictcomp>
    key: value.to_dataset(
  File "/home/yifanmai/oss/unitxt/src/unitxt/stream.py", line 40, in to_dataset
    return Dataset.from_generator(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1136, in from_generator
    ).read()
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/io/generator.py", line 49, in read
    self.builder.download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1029, in download_and_prepare
    self._download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1791, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1629, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/yifanmai/.pyenv/versions/unitxt/lib/python3.10/site-packages/datasets/builder.py", line 1786, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

However, this works if I set demos_pool_size to num_demos + 1

from unitxt import load_dataset

dataset = load_dataset(card="cards.wnli", template_card_index=0, num_demos=5, demos_pool_size=6)

This suggests that there is an off by one error in the sampling logic.

This is using unitxt's main branch from the repository at githash 81f0da1adb16dc7ff7e83eb8dd9fe8d9ac594aac on Python 3.10.13.

The text was updated successfully, but these errors were encountered:

dafnapension · 2024-12-12T09:53:27Z

Hi @yoavkatz , I reproduced, and first bumped into a related error:

  File "/home/dafna/workspaces/unitxt/src/unitxt/splitters.py", line 366, in process
    f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
AttributeError: 'RandomSampler' object has no attribute 'sample_size'

Here we need to fix to:

 f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {sample_size}."

The more involved problem is the filtering done in lines:

            source_stream = self.sampler.filter_source_by_instance(
                source_stream, instance
            )

Which filters out candidate demos if they are identical in their field called input fields to the instance being processed. This is the cause of the error reported in this issue: One of the candidate demos (an instance from the stream typically called demos_pool) is identical in its input_fields to one of the non-demo instances. Because the numbers are tight (num_demos == demos_pool_size), that (potentially single) processed instance throws the exception.

We may want to consider:
(1) Cleaning up the demos_pool itself from identical-input-fields instances (i.e., ensure all the demos are distinct in their input-fields).
(2) Allow a slack of size 1 to cope with a case of processed instance that has the same input fields as one of the candidate demos.

@yoavkatz , please advise.

yoavkatz · 2024-12-12T12:36:23Z

Why are we trying to sample demonstrations for instances in the demo pool? Perhaps this is the issue. The demo pool instances need not be processed.

dafnapension · 2024-12-12T13:08:07Z

Thank you, @yoavkatz . I think this is a different issue that should be solved in reset_pipeline of Recipe.
Perhaps add to each of the options in the case of if self.use_demos
the parameter: dont_apply_to_streams (following a needed simplification to the special type InstanceOperatorWithMultiStreamAccess).

But here we have other issues:
(1) Should we select the demo instances so that they are distinct? or simply slice from train, and live with whatever is sliced.
(2) how to avoid an exception when filtered-out demo instances drop their number below the needed.

dafnapension · 2024-12-13T05:54:41Z

I suggest to solve all problems by generating the demos not as a separate stream, but more like a fixed piece of data, to be held by a multi-stream operator that will first generate it (again -- into a fixed list) and then distribute it over all instances. Very much like operator ExtractFieldValues.
When generating, the operator will select the first distinct demos_pool_size instances from train.
It will make sure that demos_pool_size is > num_demos , so that if an ordinary instance is bumped into, which happens to be identical to one of the selected demos_pool_slize, an exception will not be thrown, as happened in this issue.
`
@yoavkatz , what do you think?

dafnapension · 2024-12-14T21:09:31Z

@yoavkatz , I prepared a PR to see how hard it is to implement the demos_pool not as a separate stream, to solve all the problems discussed here. It seems not too complicated: #1436

dafnapension mentioned this issue Dec 15, 2024

make demos_pool a local var rather than a separate stream #1436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset generation error when num_demos is equal to demos_pool_size #1416

Dataset generation error when num_demos is equal to demos_pool_size #1416

yifanmai commented Dec 7, 2024

dafnapension commented Dec 12, 2024 •

edited

Loading

yoavkatz commented Dec 12, 2024

dafnapension commented Dec 12, 2024

dafnapension commented Dec 13, 2024 •

edited

Loading

dafnapension commented Dec 14, 2024

Dataset generation error when num_demos is equal to demos_pool_size #1416

Dataset generation error when num_demos is equal to demos_pool_size #1416

Comments

yifanmai commented Dec 7, 2024

dafnapension commented Dec 12, 2024 • edited Loading

yoavkatz commented Dec 12, 2024

dafnapension commented Dec 12, 2024

dafnapension commented Dec 13, 2024 • edited Loading

dafnapension commented Dec 14, 2024

dafnapension commented Dec 12, 2024 •

edited

Loading

dafnapension commented Dec 13, 2024 •

edited

Loading