And documentation moved to https://hoover-snoop2.readthedocs.io/en/latest/
Snoop will walk a file dump, extract text and metadata for each document found and index that data into elasticsearch.
Snoop serves as a standalone web server that outputs all the data available for each indexed document.
- Documents: rtf, doc, xls, ppt & variants
- Emails: eml, emlx, msg
- Archives: zip, rar, 7z, tar
- Text files
- Html pages
Snoop depends on lxml (which compiles against libxml2 and libxslt) and
psycopg2 (which compiles against the PostgreSQL client headers). On
Debian/Ubuntu the required packages are build-essential
, libmagic
, python3-dev
,
libxml2-dev
, libxslt1-dev
and postgresql-server-dev-9.4
(or newer).
Snoop can talk to a bunch of tools. Some understand a certain data format, others do useful processing. See "Optional Dependencies" to install them.
-
Create a virtualenv and run
pip install -r requirements.txt
-
Configuration - create
snoop/site/settings/local.py
, you can useexample_local.py
as a template.DATABASES
: django database configurationSNOOP_ROOT
: path to the dumpSNOOP_ELASTICSEARCH_URL
: url to elasticsearch serverSNOOP_ELASTICSEARCH_INDEX
: name of elasticsearch index where to index the dataSNOOP_LOG_DIR
: path to the dir where worker logs will be dumped
-
Run the migrations:
$ ./manage.py migrate
-
List the files in the dump, from the path configured in
SNOOP_ROOT
, and create entries in the database.$ ./manage.py walk
-
Select documents for analysis. The argument to the
digestqueue
command is an SQLWHERE
clause to choose which documents will be analyzed.true
means all documents. They are added to thedigest
queue.$ ./manage.py digestqueue
-
Run the
digest
worker to process. All documents successfully digested will be automatically added to theindex
queue. Run as many of these processes as you want, they don't spawn any threads but are designed to be concurrent.$ ./manage.py worker digest
The
worker
command accepts a-x
-
Create/reset the elasticsearch index that you set up as
ELASTICSEARCH_INDEX
.$ ./manage.py resetindex
-
Run the
index
worker to push digested documents to elasticsearch.$ ./manage.py worker index
To digest a single document and view the JSON output:
$ ./manage.py digest 123
Tika is used for text, language and metadata extraction.
You can download the server .jar from the Apache archive and run it:
$ wget http://archive.apache.org/dist/tika/tika-server-1.17.jar
$ java -jar tika-server-1.17.jar
After that, configure the following settings:
SNOOP_TIKA_SERVER_ENDPOINT
: url to tika server. For a local server running with default settings, this should behttp://localhost:9998/
SNOOP_TIKA_MAX_FILE_SIZE
: in bytes. Files larger than this won't be sent to tika.SNOOP_TIKA_FILE_TYPES
: a list of categories of files to send to tika. Possible values are:['pdf', 'doc', 'ppt', 'text', 'xls']
.SNOOP_ANALYZE_LANG
:True
to use Tika for language detection for documents that have text.
Current setup uses 7z
to process archives.
On Debian/linux, use the p7zip
implementation.
Rar support is also needed.
$ sudo apt-get install p7zip-full
$ sudo apt-get install p7zip-rar
Configure SNOOP_SEVENZIP_BINARY
to the 7z
binary's path.
If it's installed system-wide, just use 7z
.
Set SNOOP_ARCHIVE_CACHE_ROOT
to an existing folder with write access.
This folder will serve as a cache for all the extracted archives.
Current setup uses the msgconvert
script to convert .msg
emails to .eml
.
Docs: http://www.matijs.net/software/msgconv/
$ cpanm --notest Email::Outlook::Message
Set SNOOP_MSGCONVERT_SCRIPT
to the script's path.
If it's installed system-wide, just use msgconvert
.
Set SNOOP_MSG_CACHE
to an existing folder with write access.
Current setup uses the readpst
binary to convert .pst
and .ost
emails to
the mbox format.
$ brew install libpst # mac
$ sudo apt-get install pst-utils # debian / ubuntu
Set SNOOP_READPST_BINARY
to the binary's path.
If it's installed system-wide, just use readpst
.
Set SNOOP_PST_CACHE_ROOT
to an existing folder with write access.
Set:
SNOOP_GPG_HOME
: path to the existing gpg home directory with keys to be used for decryptionSNOOP_GPG_BINARY
: path to thegpg
binary to be used in conjuction withSNOOP_GPG_HOME