Skip to content
This repository has been archived by the owner on Jan 25, 2021. It is now read-only.

Latest commit

 

History

History
36 lines (28 loc) · 1003 Bytes

README.md

File metadata and controls

36 lines (28 loc) · 1003 Bytes

Create a test dataset for benchmarking

The DigitalCorpora.org Team is providing a huge amount of digital corpora for use in computer forensics research.

At http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads you can find and download zip files containing distinct sets of around 1000 files each. This is a useful dataset for benchmarking processing speed.

Create a collection folder within your Hoover checkout dir ~/docker-setup:

cd ~/docker-setup
mkdir -p collections/benchmark
cd collections/benchmark

Use fetch.sh to get all files:

#!/bin/bash
cd data
for i in {0..9}
do
   echo "Downloading thread$i.zip"
   curl http://downloads.digitalcorpora.org/corpora/files/govdocs1/threads/thread$i.zip -o thread$i.zip
done

Process all files using Hoover:

cd ~/docker-setup
./createcollection -c benchmark
./instructions/init-benchmark.sh