Skip to content
John Goodall edited this page May 14, 2014 · 4 revisions

Stucco Architecture

Architecture diagram

Guidelines

  • Default configuration files should be defined in yaml
  • Component configuration should be read from the configuration service
  • Messages between services should be formatted as JSON and sent via HTTP
  • Logs should be sent to logstash as JSON via the TCP Input
  • Messages into Storm should be via AMQP using RabbitMQ

Prerequisites

  • RabbitMQ: the message queue that interfaces between the collectors and the rt processing pipeline
  • Titan: the distributed graph database that stores the knowledge graph, accessible through the query-service
  • Logstash: tool for collecting and managing log files from stucco components using elasticsearch and kibana
  • etcd: tool for sharing configuration and service discovery from stucco components

Collection

Description

The collectors pull data or process data streams and push the collected data (documents) into the message queue. Each type of collector is independent of others. The collectors can be implemented in any language.

Collectors can either send messages with data or messages without data. For messages without data, the collector will add the document to the document store and attach the returned id to the message.

Collectors can either be stand-alone and run on any host, or be host-based and designed to collect data specific to that host.

Collector Types

Web collector

Web collectors pull a document via HTTP/HTTPS given a URL. Documents will be decompressed, but no other processing will occur.

Content format

Various (e.g. HTML, XML, CSV).

Scraping collector

Scrapers pull data embedded within a web page via HTTP/HTTPS given a URL and an HTML pattern.

Content format

HTML.

RSS collector

RSS collectors pull an RSS/ATOM feed via HTTP/HTTPS given a URL.

Content format

XML.

Twitter collector

Twitter collectors pull Tweet data via HTTP from the Twitter Search REST API given a user (@username), hashtag (#keyword), or search term.

Content format

JSON.

Netflow collector

Netflow collectors will collect from Argus. The collector will listen for argus streams using ra tool and convert to XML and pipe to send the flow data to the message queue as a string.

Content format

TODO.

Host-based collectors

Host-based collectors collect data from an individual host using agents.

Host-based collectors should be able to collect and forward:

  • System logs
  • Hone data
  • Installed packages
Content format

If we are writing the collector, JSON. If not, whatever format the agent uses.

State

Stand-alone collectors may require state state (state should be stored with the scheduler, such as the last time a site was downloaded). Host-based collectors may need to store state (e.g. when the last collection was run).

Input Transport Protocol

Input transport protocol will depend on the type of collector.

Input Format

Input format will depend on the type of collector.

Output Transport Protocol

Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ. See the concepts documentation for information about AMQP and RabbitMQ concepts. See the protocol documentation for more on AMQP. Examples below are in Go using the amqp package. Other libraries should implement similar interfaces.

The RabbitMQ exchange is exchange-type of topic with the exchange-name of stucco.

The exchange declaration options should be:

"topic",    // type
true,       // durable
false,      // auto-deleted
false,      // internal
false,      // noWait
nil,        // arguments

The publish options should be:

stucco,     // publish to an exchange named stucco
<routingKey>, // routing to 0 or more queues
false,      // mandatory
false,      // immediate

The <routingKeys> format should be: stucco.in.<data-type>.<source-name>.<data-name (optional)>, where:

  • data-type (required): the type of data, either 'structured' or 'unstructured'
  • source-name (required): the source of the collected data, such as cve, nvd, maxmind, cpe, argus, hone.
  • data-name (optional): the name of the data, such as the hostname of the sensor.

The message options should be:

    DeliveryMode:    1,    // 1=non-persistent, 2=persistent
    Timestamp:       time.Now(),
    ContentType:     "text/plain",
    ContentEncoding: "",
    Priority:        1,    // 0-9
    HasContent:      true, // boolean
    Body:            <payload>,

DeliveryMode should be 'persistent'.

Timestamp should be automatically filled out by your amqp client library. If not, the publisher should specify.

ContentType should be "text/xml" or "text/csv" or "application/json" or "text/plain" (i.e. collectorType from the output format). This is dependent on the data source.

ContentEncoding may be required if things are, for example, gzipped.

Priority is optional.

HasContent is an application-specific part of the message header that defines whether or not there is content as part of the message. It should be defined in the message header field table using a boolean: HasContent: true (if there is data content) or HasContent: false (if the document service has the content). The spout will use the document service accordingly. This is the only application-specific data needed.

Body is the payload, either the document itself or the id if HasContent is false.

The corresponding binding keys for the queue defined in the spout will can use the wildcards to determine which spout should handle which messages:

    • (star) can substitute for exactly one word.
  • (hash) can substitute for zero or more words.

For example, stucco.in.# would listen for all input.

Output Format

There are two types of output messages: (1) messages with data and (2) messages without data that reference an ID in the document store.


Scheduler

Description

State


Message Queue (MQ)

Description

The message queue accepts input (documents) from the collectors and pushes the documents into the processing pipeline. The message queue is implemented with RabbitMQ, which implements the AMQP standard.

Configuration

The queue should hold messages until they have been processed by the Storm Spout.

Protocol

Input and output protocol is AMQP 0-9-1.

Format

The message queue should pass on the data as is from collectors.


RT

Description

RT is the Real-time processing component of Stucco implemented as a Storm cluster.

This diagram shows how all the RT components are connected. The diagram at the very top shows how RT connects to the other components described in this document.

If the data it receives is not already included in the document store, it will be added.

The data it receives will be transformed into a graph, consistent with the ontology definition, and then added into graph store.

Input Transport Protocol

See AMQP Spout

Input Format

See AMQP Spout

The spout will send an acknowledgement to the queue when the messages are received, so that the queue can release these resources.

Output Transport Protocol

See Alignment Bolt

Output Format

See Alignment Bolt

RT will also add the raw documents it receives to the document store if needed.

RT may add additional intermediate output (eg. partially labeled text documents) to the document store if needed.

RT Components

AMQP Spout

Description

The AMQP spout pulls messages off the queue and uses the routing key contained in the message to push it to the appropriate UUID Bolt.

Input Transport Protocol

Advanced Message Queuing Protocol (AMQP)

Input Format

See Collector's Output Format

Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise

UUID Bolt

Description

The UUID bolt generates and appends a universally unique identifier (UUID) to the message tuples.

The UUID is generated by computing a SHA-1 hash of the input string so that the UUID is deterministic based on the input.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise

Parse Bolt

Description

The Parse bolt parses a structured document and produces its corresponding subgraph.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • graph (string) - the GraphSON version of the structured document
  • timestamp (long) - the timestamp indicating when the message was collected

Extract Bolt

Description

The Extract bolt extracts an unstructured document's content either from the message, or by requesting the document from the document-service. The document content is then passed to bolts that can find domain-specific concepts.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
  • text (string) - the extracted text, if the original data was included in the message; the document id for the extracted text within the document service, otherwise

Concept Bolt

Description

The Concept bolt finds domain-specific concepts within unstructured text.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
  • text (string) - the extracted text, if the original data was included in the message; the document id for the extracted text within the document service, otherwise
Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
  • text (string) - the extracted text, if the original data was included in the message; the document id for the extracted text within the document service, otherwise
  • concepts (string) - JSON object representing the domain-specific concepts

Relation Bolt

Description

The Relation bolt discovers relationships between the concepts and constructs a subgraph of this knowledge.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise
  • text (string) - the extracted text, if the original data was included in the message; the document id for the extracted text within the document service, otherwise
  • concepts (string) - JSON object representing the domain-specific concepts
Output Transport Protocol

Storm's Multilang Protocol

Output Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • graph (string) - the GraphSON representation of the unstructured document's concepts (nodes) and relationships (edges)
  • timestamp (long) - the timestamp indicating when the message was collected

Alignment Bolt

Description

The Alignment bolt aligns and merges the new subgraph into the full knowledge graph.

Input Transport Protocol

Storm's Multilang Protocol

Input Format

JSON object with the following fields:

  • uuid (string) - the hash value of the message
  • graph (string) - the GraphSON subgraph representing a document
  • timestamp (long) - the timestamp indicating when the message was collected
Output Transport Protocol

HTTP/REST

Output Format

GraphSON subgraph representing a document


Document Service

Description

The document-service stores and makes available the raw documents. The backend storage is on the local filesystem.

Commands

Add Document

Be sure to set the content-type of the HTTP header when adding documents to the appropriate type (e.g. content-type: application/json for JSON data or content-type: application/pdf for PDF files.

Routes:

  • PUT server:port/document - add a document and autogenerate an id
  • PUT server:port/document/id - add a document with a specific id

Get Document

The accept-encoding can be set to gzip to compress the communication (i.e., accept-encoding: application/gzip).

The accept command can be one of the following: application/json, text/plain, or application/octet-stream. Use application/octet-stream for PDF files and other binary data.

Routes:

  • GET server:port/document/id - retrieve a document based on the specific id

Input Transport Protocol

HTTP.

Input format

See Collector's Output Format

Output Transport Protocol

HTTP.

Output format

JSON.


Query Service

Description

The Query Service provides the API for the Graph Store to allow the Alignment bolt, the Visualization/UI, and any third-party applications to interface with the graph database, implemented in Titan.

This API provides a GraphSON interface over HTTP.

The API will provide functions that facilitate common operations (eg. get a node by ID) and also allow arbitrary Gremlin queries. (As the API matures, the use of arbitrary Gremlin queries will be removed or restricted to the Alignment bolt only.)

The API will be implemented with Rexter and a set of Rexter Extensions.

Routes

  • host:port/graphs/graph/type/<typename>
    Returns a list of all nodes of type <typename>
  • host:port/graphs/graph/node/<nodename>
    Returns the node with the specified <nodename>
  • host:port/graphs/graph/tp/gremlin?<gremlinquery>
    Runs the given <gremlinquery> and returns any results

Transport Protocol

HTTP.

Transport Format

GraphSON.


Configuration Service

Description

The configuration service hosts configuration information for all services. It is implemented in etcd.

The etcd configuration is described here. The default stucco configuration is loaded into etcd when instantiated by the config-loader. The default stucco configuration, stucco.yml is in the config repo; edit that file to have the configuration changes loaded. The nested configuration in the stucco.yml gets translated to the url path, so,

stucco
  document-service
    port

Would be accessible using the following URL: http://127.0.0.1:4001/v2/keys/stucco/document-service/port.

Usage

To communicate with the configuration service, either use HTTP or use a client library that supports version 2. There is also a command line client. To test with HTTP, use curl:

curl -L http://127.0.0.1:4001/v2/keys/mykey -XPUT -d value="this is awesome"
curl -L http://127.0.0.1:4001/v2/keys/mykey

Input Transport Protocol

HTTP/REST or using client library.

Input Format

JSON (for HTTP), or specific to the client library used.

Output Transport Protocol

JSON (for HTTP), or specific to the client library used.

Output Format

HTTP/REST or using client library.


Logging Service

Description

The logging service collects and aggregates logs from each of the components. The log server is implemented as a logstash server with an elasticsearch backend.

Usage

Sending logs to the server should be done via the TCP input plugin. The port number for the development environment can be found in the logstash configuration section of the Vagrantfile.

The Vagrant VM is set up with a logstash server to aggregate log files. Log files can be input in multiple ways. There is a simple configuration set up in the Vagrantfile. The easiest way to send logs is to send them over a TCP connection. For an example in node.js and python, see https://gist.github.com/jgoodall/6323951

Input Transport Protocol

TCP.

Input Format

JSON. The log message should be a stringified JSON object with the log message in the @message field. Optionally, an array of tags can be added to the @tags field. Additional fields can be added as an object to the @fields field as key-value pairs.

Output

To view the logs, use HTTP with a web browser. The web interface to logstash is implemented by kibana (v3). For the kibana port number in the development environment, see webserver_port in the kibana configuration section of the Vagrantfile.

Clone this wiki locally