You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@lhoestq wondering if the team has thought about this and if there are any recommendations?
Currently when processing datasets some examples are bound to get filtered out, whether it's due to bad format, or length is too long, or any other custom filters that might be getting applied. Let's just focus on the filter by length for now, since that would be something that gets applied dynamically for each training run. Say we want to show a graph in W&B with the running total of the number of filtered examples so far.
What would be a good way to go about hooking this up? Because the map/filter operations happen before the DataLoader batches are created, at training time if we're just grabbing batches from the DataLoader then we won't know how many things have been filtered already. But there's not really a good way to include a 'num_filtered' key into the dataset itself either because dataset map/filter process examples independently and don't have a way to track a running sum.
The only approach I can kind of think of is having a 'is_filtered' key in the dataset, and then creating a custom batcher/collator that reads that and tracks the metric?
The text was updated successfully, but these errors were encountered:
@lhoestq wondering if the team has thought about this and if there are any recommendations?
Currently when processing datasets some examples are bound to get filtered out, whether it's due to bad format, or length is too long, or any other custom filters that might be getting applied. Let's just focus on the filter by length for now, since that would be something that gets applied dynamically for each training run. Say we want to show a graph in W&B with the running total of the number of filtered examples so far.
What would be a good way to go about hooking this up? Because the map/filter operations happen before the DataLoader batches are created, at training time if we're just grabbing batches from the DataLoader then we won't know how many things have been filtered already. But there's not really a good way to include a 'num_filtered' key into the dataset itself either because dataset map/filter process examples independently and don't have a way to track a running sum.
The only approach I can kind of think of is having a 'is_filtered' key in the dataset, and then creating a custom batcher/collator that reads that and tracks the metric?
The text was updated successfully, but these errors were encountered: