README

The Protocol Informatics Framework
----------------------------------

Written by Marshall Beddoe <mbeddoe@baselineresearch.net>
Extended and modified by Lothar Braun <braun@net.in.tum.de>
Copyright (c) 2004 Baseline Research
Copyright (c) 2011 Lothar Braun

Source code repository available at 

https://github.com/constcast/Protocol-Informatics

Overview:

The Protocol Informatics project is a software framework that allows for
advanced sequence and protocol stream analysis by utilizing bioinformatics
algorithms. The sole purpose of this software is to identify protocol fields in
unknown or poorly documented network protocol formats. The algorithms that are
utilized perform comparative analysis on a series of samples to better
understand the underlying structure of the otherwise random-looking data. The
PI framework was designed for experimentation through the use of a widget-based
component set.

The framework aims at including a number of different algorithms that help
with identifying protocol structures from network trace. It is shipped with a 
command line interface that interactively allows one to control the process
of inferring protocol information from network traces. 

Requirements:
-------------

Python >= 2.4       http://www.python.org
numpy               http://numpy.scipy.org/
PyYAML              http://pyyaml.org/

Optional:
Pcapy               http://oss.coresecurity.com/projects/pcapy.html
Pydot               http://code.google.com/p/pydot/


Controlling PI using the command line interface:
------------------------------------------------

All commands in the interface should be documented using an online
help.  Whenever you want to learn more about a command, just use the
online help:

inf> help quit
Quit the program.

Program start:

You can start the program with or without an configuration file:

./main -c config.yml

If you do not specify a configuration file, a default file
'config.yml' in the current working directory will be used. If that
file does not exist, a default configuration will be loaded and
stored in 'config.yml'.

Whenever you make any changes to the configuration file in your
program, e.g. using the "config" command, you can save your
configuration using the "saveconfig" command. You can then load the
configuration on program start. If you set the configuration
parameters "inputFile" and "format" in your config, PI will
automatically try to read input from this file.

Reading input data:

There are basically two ways to read sequences into your
environment. You can set the "inputFile" and "format" configuration
variables, save your config using "saveconfig", and restart the
program using the "restart" command. 
Or you can explicitly read input using the "read" command in your
environment. We will now show how first steps in the command line
interface can look like. 

First steps:

Start the program:

$ ./main.py 
No default configuration found. Creating a default config file
"config.yml".

Welcome to Protocol-Informatics. What do you want to do today?

inf> 

This creates your default configuration file with default parameters
and drops you into the command line prompt. You can list the available
commands using the "help" command:

: inf> help
: 
: Documented commands (type help <topic>):
: ========================================
: EOF  PI  config  env  exit  help  quit  read  restart  saveconfig  seqs  show
: 
: inf>

For each command, you can get verbose help by specifying the commands'
name to the help command itself: 


: inf> help read
: Command syntax: read [<bro|pcap|ascii|config>] <file>
:
: Tries to read file <file> in the specified format. If format
: equals "config", a new configuration file is read from <file>.
: In all other cases, input data for the protocol inferences are 
: read in the specified format (bro, pcap, ascii)
: inf> 

An important command is the "config" command which can be used to read
and set configuration variables. If it is run without an argument, it
will print the configuration:

: inf> config
: ethOffset            14				
: maxMessages          50
: weight               1.0
: format               pcap
: graph                False
: textBased            False
: configFile           config.yml
: messageDelimiter     None
: onlyUniq             False
: gnuplotFile          None
: inputFile            None
: interactive          True

The configuration parameters are important for controlling the program
and will be documented in the following sections. The configuration
parameters can be group by their meaning and use in the modules. For
the main module, denoted by inf> there are the following important
parameters:

ethOffset:
	Important when pcap files are read: Defines the length of the
	ETH header. The default value is 14 (use 18 if you have a
	trace from a VLAN tagged network.
maxMessages:
	Defines how many messages will be read by default from the
	input traces. If this is set to 0, all messages are read from
	the input file.
onlyUniq:
	Controls whether only unique messages are read from the input
	file or if duplicate messages are allowed. Please note: this
	parameter depends on the connection context. If this
	configuration parameter is set to true, this will only remove
	duplicate messages from within connections. Duplicate messages
	that are distributed over multiple connections will still be
	part of the input data. 
inputFile:
	Defines the filename that will be used to read messages from
format:
	Defines the format that is used to read the filename specified
	by inputFile. Possible values:
	   - pcap
	     - expects a pcap file as produced by tcpdump -w <filename>
	   - bro
	     - expects an adu file as produced by bro with the script
	       that is shipped with this source code in 
	       bro-scripts/adu_writer.bro
	   - ascii
	     - expects a textfile which contains a number of messages
	       separate with the newline character
	
	PCAP Files can easily be converted into BRO files via the following command:
	
	CD to bro-script directory
	<path_to_bro>/bin/bro -C -r <path_to_pcap> adu_writer.bro
configFile:
	filename of the configuration yml file which is used to store
	the current config with the saveconfig command.
interactive:
	Defines if PI should run in interactive or non-interactive
	mode. Currently, only interactive mode is supported. 

Other configuration parameters are only necessary in
submodules. Currently, we have the following submodules:
	  - seqs
		Offers methods for changing and looking at input
		data. This module allows, for example, to select a
		random subsample of the input data, or to only select
		unique messages
		
	  - PI
		Offers the original functionality of the PI
		framework. Can create distance matrices, phylogeny
		trees and can perform multi-sequence aligning. Please
		find more information on the code in PI/README.

Configuration parameters for the "seqs" module:

messageDelimiter:
	This configuration parameter can be used to split messages
	according to a sequence of characters.

fieldDelimiter:	 
	Currently unused.

Configuration parameters for the PI module:

graph: 
       Decides whether graphs are written to disk

gnuplotFile:
	Currently unused

weight:
	Weight used to determine how many clusters are found when
	grouping messages according to their similarity.
	
=== Discoverer module specific config options ===

minWordLength:
	The minimum lenght of printable characters considered as a text token

ASCIILowerBound:
	The lowest ASCII character considered as printable token (used for text classification)

ASCIIUpperBound:	
	The highest ASCII character considered as printable token (used for text classification)
	
dumpFile:
	Path where to write the discoverer results to when the 'dumpresult' command is executed.
	The filename is taken from the inputFile configuration