Skip to content

Pull requests: huggingface/datatrove

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Reviews
Assignee
Filter by who’s assigned
Sort

Pull requests list

Add glob pattern for hash index
#313 opened Dec 11, 2024 by jordane95 Loading…
Resolve issue 308
#309 opened Nov 29, 2024 by habanoz Loading…
load_tokenizer can now load local hf folder
#306 opened Nov 26, 2024 by ceferisbarov Loading…
Adding Megatron Tokenization pipeline
#304 opened Nov 14, 2024 by TJ-Solergibert Loading…
Use spaCy tokenizer for Dutch
#284 opened Sep 4, 2024 by BramVanroy Loading…
Video support for datatrove
#271 opened Aug 21, 2024 by guipenedo Draft
Fix SENTINEL cluster
#250 opened Jul 13, 2024 by jordane95 Loading…
Mersenne prime hashing fix.
#200 opened May 28, 2024 by Apsod Loading…
Linewise filters
#125 opened Mar 14, 2024 by guipenedo Draft
ProTip! Updated in the last three days: updated:>2024-12-08.