Skip to content
John Goodall edited this page May 14, 2014 · 1 revision

document storage

Storing documents in the document store and pointing to the original document from the knowledge graph preserves data provenance and allows users to retrieve the original document that create in the addition or modification of nodes in the graph. There may be cases where the data source terms of service or where data volume limits our ability to store everything in the document store.

Exogenous data

By default, we should store all exogenous, unstructured, documents in the document store. Both the current document-service and elasticsearch attachments include Tika for text extraction.

By default, we should also store exogenous, structured documents.

Some sites may not allow us to store their data (e.g. Twitter), so we need a field to point to the original URL. This should be configurable at the data source level via the configuration as a field called store; where store equals false, documents will not be stored (or will only be stored for as long as the processing pipeline needs access to the document).

Endogenous data

Endogenous data may get too big, so we may need a time to live. This can be configured in elasticsearch per index. This should be set at the data source level via the configuration as a field called ttl, where the value needs to be input into the specific index in elasticsearch using the put mapping API. The graph database also needs to include a field that identifies that the provenance is no longer available, such as by changing a document-source field from the document id to expired.

Clone this wiki locally