The big friendly filter 😁 (originally written by Dirk @ AI2, updated by me)
- Install Rust on your machine.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Add
~/.cargo/bin
to yourPATH
environment variable.
- Run
cargo build --release
. It places the binary attarget/release/bff
. - Run
./target/release/bff --help
to see the available options.
There are three modes bff
(local input -> local output), bff-remote
(S3 input -> S3 output), and sysreq
(for assessing system requirements). We always need an input, output, false positive rate, and expected number of ngrams. But then there's some optional hyperparameters:
--min-ngram-size
: In pargraph/both mode, we ignore any paragraphs shorter than this. Defaults to 5.--max-ngram-size
: The "working width" of shinglings of ngrams: e.g., for long paragraphs/documents, we check membership over ngrams of this size. Defaults to 13.--filtering-threshold
: If at least this fraction of ngrams is present, we remove the entire paragraph/document. Defaults to 0.8
And some REMOTE ONLY arguments:
--shard-num
: For large nummbers of files, sharding is helpful. This selects some subset of the files. Defaults to 0--num-shards
: Dictates how many shards we have. Defaults to 1.
For files that exist locally, say a directory to_be_deduped/
, we can output deduplicated versions of these files in has_been_deduped/
like:
--inputs to_be_deduped \
--output-directory has_been_deduped \
--expected-ngram-count 12345678 \
--fp-rate 0.01
For files that exist on S3, say with the prefix s3://my-bucket/to_be_deduped/
, we can output deduplicated versions of these files in s3://my-bucket/has_been_deduped
like:
--bucket my-bucket \
--input-dir to_be_deduped \
--output_dir has_been_deduped \
--expected-ngram-count 12345678 \\
--fp-rate 0.01
There's also some options to preload or save the bloom filter itself, but you can check the code for those.