-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features of IterableDataset set to None by remove column #5284
Comments
Related to #5245 |
#self-assign |
Thanks @lhoestq and @alvarobartt! This would be extremely helpful to have working for the Whisper fine-tuning event - we're only training using streaming mode, so it'll be quite important to have this feature working to make training as easy as possible! c.f. https://twitter.com/sanchitgandhi99/status/1592188332171493377 |
I'm almost done with at least a temporary fix to I'll probably open the PR for review either tomorrow or Sunday hopefully! Glad I can help you and HuggingFace 🤗 |
Awesome - thank you so much for this PR @alvarobartt! Is much appreciated! |
@sanchit-gandhi PR is ready and open for review at #5287, but there's still one issue I may need @lhoestq's input 🤗 |
Let us know @sanchit-gandhi if you need a new release of |
Thanks for the fix guys! We can direct people to install |
Hey guys, any update around this? I'm facing the same issue with a streamable dataset. |
Hi @asennoussi so this was already fixed and released as part of https://github.com/huggingface/datasets/releases/tag/2.8.0, so you should be able to install it as |
Still facing the same issue though:
Output:
|
Hmm weird, could you try to print print("Processed features: ", vectorized_datasets['train'].features) again after iterating over the |
Didn't seem to fix it:
|
Actually the culprit looks to be this one: I still get
|
The culprit is definitely |
Yes, indeed Anyway, now you can pass the from datasets import Features
vectorized_datasets = raw_datasets.map(
prepare_dataset,
features=Features(
{"path": raw_datasets["train"].info.features["path"], "audio": raw_datasets["train"].info.features["audio"], "sentence": raw_datasets["train"].info.features["sentence"]}
),
).with_format("torch") Also, to let you know, when calling More information about the latter at #5287 |
@asennoussi alternatively you can just call |
Got it thanks a lot! |
I am very late to the game, but facing the same issue still after almost 2 years. Our dataset type is IterableDatasetDict and after mapping we still get the none features. We needed the feature names to be able to in later stage use interleave_datasets() to combine two datasets together, otherwise we got errors. I just want to mention here, if somebody is using a function like prepare_dataset(batch), for whatever feature you produce here, you need to only cast these features manually to your features. Just as an example: ds = IterableDatasetDict()
ds["train"] = load_dataset(str(DATA.DIR), "default", split="train", trust_remote_code=True, streaming=True)
ds = ds.cast_column("audio_filepath", Audio(sampling_rate=None))
def prepare_dataset(batch):
audio = batch["audio_filepath"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(
audio["array"], sampling_rate=audio["sampling_rate"]
).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["text"]).input_ids
return batch
add_new_features = Features({'input_features': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None),
'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}) # features you produce in your prepare_dataset()
ds = ds.map(prepare_dataset)
ds = ds.cast(add_new_features) |
Describe the bug
The
remove_column
method of the IterableDataset sets the dataset features to None.Steps to reproduce the bug
Print Output:
Expected behavior
The features should be those not removed by the
remove_column
method, i.e. audio and text.Environment info
datasets
version: 2.7.1(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)
cc @polinaeterna @lhoestq
The text was updated successfully, but these errors were encountered: