Features of IterableDataset set to None by remove column #5284

sanchit-gandhi · 2022-11-23T10:54:59Z

Describe the bug

The remove_column method of the IterableDataset sets the dataset features to None.

Steps to reproduce the bug

from datasets import Audio, load_dataset

# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# check original features
print("Original features: ", dataset.features.keys())

# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']

dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)

# check processed features, uh-oh!
print("Processed features: ", dataset.features)

# streaming the first audio sample still works
print("First sample:", next(iter(ds)))

Print Output:

Original features:  dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features:  None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
       -0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}

Expected behavior

The features should be those not removed by the remove_column method, i.e. audio and text.

Environment info

datasets version: 2.7.1
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.15
PyArrow version: 9.0.0
Pandas version: 1.3.5

(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)

cc @polinaeterna @lhoestq

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-11-23T13:48:49Z

Related to #5245

alvarobartt · 2022-11-23T15:30:35Z

#self-assign

sanchit-gandhi · 2022-11-25T16:35:03Z

Thanks @lhoestq and @alvarobartt!

This would be extremely helpful to have working for the Whisper fine-tuning event - we're only training using streaming mode, so it'll be quite important to have this feature working to make training as easy as possible!

c.f. https://twitter.com/sanchitgandhi99/status/1592188332171493377

alvarobartt · 2022-11-25T16:40:14Z

Thanks @lhoestq and @alvarobartt!

This would be extremely helpful to have working for the Whisper fine-tuning event - we're only training using streaming mode, so it'll be quite important to have this feature working to make training as easy as possible!

c.f. https://twitter.com/sanchitgandhi99/status/1592188332171493377

I'm almost done with at least a temporary fix to rename_column, rename_columns, and remove_columns, just trying to figure out how to extend it to the map function itself!

I'll probably open the PR for review either tomorrow or Sunday hopefully! Glad I can help you and HuggingFace 🤗

sanchit-gandhi · 2022-11-25T16:43:57Z

Awesome - thank you so much for this PR @alvarobartt! Is much appreciated!

alvarobartt · 2022-11-26T10:07:58Z

@sanchit-gandhi PR is ready and open for review at #5287, but there's still one issue I may need @lhoestq's input 🤗

lhoestq · 2022-11-28T12:55:32Z

Let us know @sanchit-gandhi if you need a new release of datasets soon with this fix included :)

sanchit-gandhi · 2022-11-28T15:18:08Z

Thanks for the fix guys! We can direct people to install datasets from main if that's easier!

asennoussi · 2023-02-02T04:03:47Z

Hey guys, any update around this? I'm facing the same issue with a streamable dataset.

alvarobartt · 2023-02-02T07:41:33Z

Hi @asennoussi so this was already fixed and released as part of https://github.com/huggingface/datasets/releases/tag/2.8.0, so you should be able to install it as pip install datasets==2.8.0 or just to use pip install datasets --upgrade to get the latest version, as of now, the https://github.com/huggingface/datasets/releases/tag/2.9.0 released last week! 🤗

asennoussi · 2023-02-02T07:54:40Z

Still facing the same issue though:

from datasets import IterableDatasetDict, load_dataset

raw_datasets = vectorized_datasets = IterableDatasetDict()


raw_datasets["train"] = load_dataset("asennoussi/private", split="train", use_auth_token=True, streaming=True)
raw_datasets["test"] = load_dataset("asennoussi/private", split="test", use_auth_token=True, streaming=True)

print("Original features: ", raw_datasets['train'].features.keys())

...

def prepare_dataset(batch):

    # load and (possibly) resample audio datato 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]
    
    # optional pre-processing steps
    transcription = batch["sentence"]
    
    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    batch["labels_length"] = len(batch["labels"])
    return batch
...
vectorized_datasets = vectorized_datasets.remove_columns(['input_length', 'labels_length']+list(next(iter(raw_datasets.values())).features))
print("Processed features: ", vectorized_datasets['train'].features)
print("First sample:", next(iter(vectorized_datasets['train'])))

Output:

Original features:  dict_keys(['path', 'audio', 'sentence'])
Processed features:  None

alvarobartt · 2023-02-02T07:58:12Z

Hmm weird, could you try to print

print("Processed features: ", vectorized_datasets['train'].features)

again after iterating over the vectorized_datasets? In the code above, should be last line :)

asennoussi · 2023-02-02T08:05:08Z

Didn't seem to fix it:

Original features:  dict_keys(['path', 'audio', 'sentence'])
Processed features:  None
Processed features:  None

asennoussi · 2023-02-02T08:07:13Z

Actually the culprit looks to be this one:
vectorized_datasets = raw_datasets.map(prepare_dataset).with_format("torch")
When I remove this line: vectorized_datasets = vectorized_datasets.remove_columns(['input_length', 'labels_length']+list(next(iter(raw_datasets.values())).features))

I still get

Processed features:  None

asennoussi · 2023-02-02T08:39:51Z

The culprit is definitely .map
Just validated it.
Any idea please?

alvarobartt · 2023-02-02T09:01:32Z

The culprit is definitely .map Just validated it. Any idea please?

Yes, indeed .map losses the features, because AFAIK pre-fetching the data to infer the features is expensive and not ideal, that's part of this issue #3888

Anyway, now you can pass the features as a param to .map as follows:

from datasets import Features
vectorized_datasets = raw_datasets.map(
    prepare_dataset,
    features=Features(
        {"path": raw_datasets["train"].info.features["path"], "audio": raw_datasets["train"].info.features["audio"], "sentence": raw_datasets["train"].info.features["sentence"]}
    ),
).with_format("torch")

Also, to let you know, when calling .remove_columns over an IterableDataset, the features are not lost, as well as .rename_column and rename_columns :)

More information about the latter at #5287

alvarobartt · 2023-02-02T09:04:42Z

@asennoussi alternatively you can just call ._resolve_features() from your IterableDataset and it will pre-fetch the data to resolve the features, but note that feature-inference is not as accurate as if you manually specify which features and feature-types the IterableDataset has, as mentioned in the comment above, the alternative is to provide features param to .map 🤗

asennoussi · 2023-02-02T09:05:50Z

Got it thanks a lot!

SahebehDadboud · 2025-02-07T11:28:30Z

The culprit is definitely .map Just validated it. Any idea please?

Yes, indeed .map losses the features, because AFAIK pre-fetching the data to infer the features is expensive and not ideal, that's part of this issue #3888

Anyway, now you can pass the features as a param to .map as follows:

from datasets import Features
vectorized_datasets = raw_datasets.map(
prepare_dataset,
features=Features(
{"path": raw_datasets["train"].info.features["path"], "audio": raw_datasets["train"].info.features["audio"], "sentence": raw_datasets["train"].info.features["sentence"]}
),
).with_format("torch")
Also, to let you know, when calling .remove_columns over an IterableDataset, the features are not lost, as well as .rename_column and rename_columns :)

More information about the latter at #5287

I am very late to the game, but facing the same issue still after almost 2 years. Our dataset type is IterableDatasetDict and after mapping we still get the none features. We needed the feature names to be able to in later stage use interleave_datasets() to combine two datasets together, otherwise we got errors. I just want to mention here, if somebody is using a function like prepare_dataset(batch), for whatever feature you produce here, you need to only cast these features manually to your features. Just as an example:

ds = IterableDatasetDict()

ds["train"] = load_dataset(str(DATA.DIR), "default", split="train", trust_remote_code=True, streaming=True)

ds = ds.cast_column("audio_filepath", Audio(sampling_rate=None))

def prepare_dataset(batch):
            audio = batch["audio_filepath"]
            # compute log-Mel input features from input audio array
            batch["input_features"] = feature_extractor(
                audio["array"], sampling_rate=audio["sampling_rate"]
            ).input_features[0]
            # encode target text to label ids
            batch["labels"] = tokenizer(batch["text"]).input_ids
            return batch

add_new_features = Features({'input_features': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None),
   'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}) # features you produce in your prepare_dataset()

ds = ds.map(prepare_dataset)  
ds = ds.cast(add_new_features)

sanchit-gandhi added bug Something isn't working streaming labels Nov 23, 2022

github-actions bot assigned alvarobartt Nov 23, 2022

alvarobartt mentioned this issue Nov 23, 2022

Fix methods using IterableDataset.map that lead to features=None #5287

Merged

lhoestq linked a pull request Nov 28, 2022 that will close this issue

Fix methods using IterableDataset.map that lead to features=None #5287

Merged

lhoestq closed this as completed in #5287 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features of IterableDataset set to None by remove column #5284

Features of IterableDataset set to None by remove column #5284

sanchit-gandhi commented Nov 23, 2022 •

edited

Loading

lhoestq commented Nov 23, 2022

alvarobartt commented Nov 23, 2022

sanchit-gandhi commented Nov 25, 2022 •

edited

Loading

alvarobartt commented Nov 25, 2022

sanchit-gandhi commented Nov 25, 2022

alvarobartt commented Nov 26, 2022

lhoestq commented Nov 28, 2022

sanchit-gandhi commented Nov 28, 2022

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

asennoussi commented Feb 2, 2023

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

SahebehDadboud commented Feb 7, 2025 •

edited

Loading

Features of IterableDataset set to None by remove column #5284

Features of IterableDataset set to None by remove column #5284

Comments

sanchit-gandhi commented Nov 23, 2022 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Nov 23, 2022

alvarobartt commented Nov 23, 2022

sanchit-gandhi commented Nov 25, 2022 • edited Loading

alvarobartt commented Nov 25, 2022

sanchit-gandhi commented Nov 25, 2022

alvarobartt commented Nov 26, 2022

lhoestq commented Nov 28, 2022

sanchit-gandhi commented Nov 28, 2022

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

asennoussi commented Feb 2, 2023

asennoussi commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

alvarobartt commented Feb 2, 2023

asennoussi commented Feb 2, 2023

SahebehDadboud commented Feb 7, 2025 • edited Loading

sanchit-gandhi commented Nov 23, 2022 •

edited

Loading

sanchit-gandhi commented Nov 25, 2022 •

edited

Loading

SahebehDadboud commented Feb 7, 2025 •

edited

Loading