What's Changed
- Readme nits by @hynky1999 in #280
- Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
- Fix languages listify bug by @BramVanroy in #294
- [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
- [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
- FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
- upgrades to support numpy 2.0
- added additional word tokenizers and revamped word tokenizer assignment mechanism
- MinHash optimizations + new rust tool to speed up step3
- MinHash cluster sizes feature
- fixed memory leaks from some word tokenizers
- updated url blocklists
- added caching to some word tokenization calls
- glotlid support
- general bugfixes
New Contributors
- @lyuwen made their first contribution in #286
- @BramVanroy made their first contribution in #294
- @silverriver made their first contribution in #296
- @Youggls made their first contribution in #307
Full Changelog: v0.3.0...v0.4.0