Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features of IterableDataset set to None by remove column #5284

Closed
sanchit-gandhi opened this issue Nov 23, 2022 · 19 comments · Fixed by #5287
Closed

Features of IterableDataset set to None by remove column #5284

sanchit-gandhi opened this issue Nov 23, 2022 · 19 comments · Fixed by #5287
Assignees
Labels
bug Something isn't working streaming

Comments

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Nov 23, 2022

Describe the bug

The remove_column method of the IterableDataset sets the dataset features to None.

Steps to reproduce the bug

from datasets import Audio, load_dataset

# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# check original features
print("Original features: ", dataset.features.keys())

# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']

dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)

# check processed features, uh-oh!
print("Processed features: ", dataset.features)

# streaming the first audio sample still works
print("First sample:", next(iter(ds)))

Print Output:

Original features:  dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features:  None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
       -0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}

Expected behavior

The features should be those not removed by the remove_column method, i.e. audio and text.

Environment info

  • datasets version: 2.7.1
  • Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.15
  • PyArrow version: 9.0.0
  • Pandas version: 1.3.5

(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)

cc @polinaeterna @lhoestq

@sanchit-gandhi sanchit-gandhi added bug Something isn't working streaming labels Nov 23, 2022
@lhoestq
Copy link
Member

lhoestq commented Nov 23, 2022

Related to #5245

@alvarobartt
Copy link
Member

#self-assign

@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Nov 25, 2022

Thanks @lhoestq and @alvarobartt!

This would be extremely helpful to have working for the Whisper fine-tuning event - we're only training using streaming mode, so it'll be quite important to have this feature working to make training as easy as possible!

c.f. https://twitter.com/sanchitgandhi99/status/1592188332171493377

@alvarobartt
Copy link
Member

Thanks @lhoestq and @alvarobartt!

This would be extremely helpful to have working for the Whisper fine-tuning event - we're only training using streaming mode, so it'll be quite important to have this feature working to make training as easy as possible!

c.f. https://twitter.com/sanchitgandhi99/status/1592188332171493377

I'm almost done with at least a temporary fix to rename_column, rename_columns, and remove_columns, just trying to figure out how to extend it to the map function itself!

I'll probably open the PR for review either tomorrow or Sunday hopefully! Glad I can help you and HuggingFace 🤗

@sanchit-gandhi
Copy link
Contributor Author

Awesome - thank you so much for this PR @alvarobartt! Is much appreciated!

@alvarobartt
Copy link
Member

@sanchit-gandhi PR is ready and open for review at #5287, but there's still one issue I may need @lhoestq's input 🤗

@lhoestq
Copy link
Member

lhoestq commented Nov 28, 2022

Let us know @sanchit-gandhi if you need a new release of datasets soon with this fix included :)

@sanchit-gandhi
Copy link
Contributor Author

Thanks for the fix guys! We can direct people to install datasets from main if that's easier!

@asennoussi
Copy link

Hey guys, any update around this? I'm facing the same issue with a streamable dataset.

@alvarobartt
Copy link
Member

Hi @asennoussi so this was already fixed and released as part of https://github.com/huggingface/datasets/releases/tag/2.8.0, so you should be able to install it as pip install datasets==2.8.0 or just to use pip install datasets --upgrade to get the latest version, as of now, the https://github.com/huggingface/datasets/releases/tag/2.9.0 released last week! 🤗

@asennoussi
Copy link

Still facing the same issue though:

from datasets import IterableDatasetDict, load_dataset

raw_datasets = vectorized_datasets = IterableDatasetDict()


raw_datasets["train"] = load_dataset("asennoussi/private", split="train", use_auth_token=True, streaming=True)
raw_datasets["test"] = load_dataset("asennoussi/private", split="test", use_auth_token=True, streaming=True)

print("Original features: ", raw_datasets['train'].features.keys())

...

def prepare_dataset(batch):

    # load and (possibly) resample audio datato 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]
    
    # optional pre-processing steps
    transcription = batch["sentence"]
    
    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    batch["labels_length"] = len(batch["labels"])
    return batch
...
vectorized_datasets = vectorized_datasets.remove_columns(['input_length', 'labels_length']+list(next(iter(raw_datasets.values())).features))
print("Processed features: ", vectorized_datasets['train'].features)
print("First sample:", next(iter(vectorized_datasets['train'])))

Output:

Original features:  dict_keys(['path', 'audio', 'sentence'])
Processed features:  None

@alvarobartt
Copy link
Member

Hmm weird, could you try to print

print("Processed features: ", vectorized_datasets['train'].features)

again after iterating over the vectorized_datasets? In the code above, should be last line :)

@asennoussi
Copy link

Didn't seem to fix it:

Original features:  dict_keys(['path', 'audio', 'sentence'])
Processed features:  None
Processed features:  None

@asennoussi
Copy link

Actually the culprit looks to be this one:
vectorized_datasets = raw_datasets.map(prepare_dataset).with_format("torch")
When I remove this line: vectorized_datasets = vectorized_datasets.remove_columns(['input_length', 'labels_length']+list(next(iter(raw_datasets.values())).features))

I still get

Processed features:  None

@asennoussi
Copy link

The culprit is definitely .map
Just validated it.
Any idea please?

@alvarobartt
Copy link
Member

The culprit is definitely .map Just validated it. Any idea please?

Yes, indeed .map losses the features, because AFAIK pre-fetching the data to infer the features is expensive and not ideal, that's part of this issue #3888

Anyway, now you can pass the features as a param to .map as follows:

from datasets import Features
vectorized_datasets = raw_datasets.map(
    prepare_dataset,
    features=Features(
        {"path": raw_datasets["train"].info.features["path"], "audio": raw_datasets["train"].info.features["audio"], "sentence": raw_datasets["train"].info.features["sentence"]}
    ),
).with_format("torch")

Also, to let you know, when calling .remove_columns over an IterableDataset, the features are not lost, as well as .rename_column and rename_columns :)

More information about the latter at #5287

@alvarobartt
Copy link
Member

@asennoussi alternatively you can just call ._resolve_features() from your IterableDataset and it will pre-fetch the data to resolve the features, but note that feature-inference is not as accurate as if you manually specify which features and feature-types the IterableDataset has, as mentioned in the comment above, the alternative is to provide features param to .map 🤗

@asennoussi
Copy link

Got it thanks a lot!

@SahebehDadboud
Copy link

SahebehDadboud commented Feb 7, 2025

The culprit is definitely .map Just validated it. Any idea please?

Yes, indeed .map losses the features, because AFAIK pre-fetching the data to infer the features is expensive and not ideal, that's part of this issue #3888

Anyway, now you can pass the features as a param to .map as follows:

from datasets import Features
vectorized_datasets = raw_datasets.map(
prepare_dataset,
features=Features(
{"path": raw_datasets["train"].info.features["path"], "audio": raw_datasets["train"].info.features["audio"], "sentence": raw_datasets["train"].info.features["sentence"]}
),
).with_format("torch")
Also, to let you know, when calling .remove_columns over an IterableDataset, the features are not lost, as well as .rename_column and rename_columns :)

More information about the latter at #5287

I am very late to the game, but facing the same issue still after almost 2 years. Our dataset type is IterableDatasetDict and after mapping we still get the none features. We needed the feature names to be able to in later stage use interleave_datasets() to combine two datasets together, otherwise we got errors. I just want to mention here, if somebody is using a function like prepare_dataset(batch), for whatever feature you produce here, you need to only cast these features manually to your features. Just as an example:

ds = IterableDatasetDict()

ds["train"] = load_dataset(str(DATA.DIR), "default", split="train", trust_remote_code=True, streaming=True)

ds = ds.cast_column("audio_filepath", Audio(sampling_rate=None))

def prepare_dataset(batch):
            audio = batch["audio_filepath"]
            # compute log-Mel input features from input audio array
            batch["input_features"] = feature_extractor(
                audio["array"], sampling_rate=audio["sampling_rate"]
            ).input_features[0]
            # encode target text to label ids
            batch["labels"] = tokenizer(batch["text"]).input_ids
            return batch

add_new_features = Features({'input_features': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None),
   'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}) # features you produce in your prepare_dataset()

ds = ds.map(prepare_dataset)  
ds = ds.cast(add_new_features)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working streaming
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants