-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset generation error when num_demos is equal to demos_pool_size #1416
Comments
Hi @yoavkatz , I reproduced, and first bumped into a related error: File "/home/dafna/workspaces/unitxt/src/unitxt/splitters.py", line 366, in process
f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
AttributeError: 'RandomSampler' object has no attribute 'sample_size' Here we need to fix to: f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {sample_size}." The more involved problem is the filtering done in lines: source_stream = self.sampler.filter_source_by_instance(
source_stream, instance
) Which filters out candidate demos if they are identical in their field called We may want to consider: @yoavkatz , please advise. |
Why are we trying to sample demonstrations for instances in the demo pool? Perhaps this is the issue. The demo pool instances need not be processed. |
Thank you, @yoavkatz . I think this is a different issue that should be solved in But here we have other issues: |
I suggest to solve all problems by generating the demos not as a separate stream, but more like a fixed piece of data, to be held by a multi-stream operator that will first generate it (again -- into a fixed list) and then distribute it over all instances. Very much like operator ExtractFieldValues. |
I would expect
load_dataset()
to work normally whennum_demos
is equal todemos_pool_size
, such as in the following code snippet:Instead, I get the error:
However, this works if I set
demos_pool_size
tonum_demos + 1
This suggests that there is an off by one error in the sampling logic.
This is using unitxt's main branch from the repository at githash
81f0da1adb16dc7ff7e83eb8dd9fe8d9ac594aac
on Python 3.10.13.The text was updated successfully, but these errors were encountered: