Bitextor test data - WARCs v1.0
Collection of XZ compressed files that can be used to run regression tests with Bitextor (run-tests.sh
). Tests can be run on three websites crawled between January 25 and 28 of 2019. The three websites are:
- [greenpeace.org/canada], which is under Creative Commons Attribution 2.0,
- [http://kremlin.ru/], which is under Creative Commons Attribution 4.0,
- and * [https://primeminister.gr/], which is under Creative Commons Attribution-NonDerivatives 4.0
kremlin-many-small.tar.xz
package is a test using kremlin.warc.xz
content, but each warc only contains one pair of documents (from Bitextor run of kremlin.warc.xz
).