Skip to content
GrantNakamura edited this page Apr 29, 2015 · 1 revision

Provenance

For the purposes of this project, we define provenance as metadata providing information about how a given property's value evolved. That history supports several goals, including:

  • Better understanding the significance of a given property
  • Identifying trending changes in the graph database
  • Aiding decisions in graph alignment Provenance could be as complex as a complete history of changes, or as minimal as recording the most recent modification times. Ideally, we would like a fairly detailed history. However, Gremlin has several constraints that impact our ability to provide this. The most serious constraint is that Gremlin supports indexing of only single-valued properties, making it difficult to have a list-like history. Another constraint is that while Gremlin supports sub-properties, they must be single-valued.

We discussed these constraints in a team meeting, and agreed that we still wanted a detailed history even if that means losing the ability to index the provenance data.

We propose:

  • Using metadata properties providing information about other properties
  • Metadata properties are optional, to be used only where useful
  • Most metadata properties will be list-valued. In general, these are:
    1. updateTime. This will list times when updated information came in relevant to determining a property's value.
    2. sourceId.
    3. updateValue. Note: In some cases, the data value may be unchanged (for example, if a new source came in but reported the same value). These lists are parallel. That is, updateTime[i], sourceId[i], and updateValue[i] are all metadata for the same event.
  • Some metadata properties will be single-valued, solely for the purpose of indexing. For example, having a **lastModifiedTime **could be useful for finding nodes where an important property changed within a given time window. Note: "Modified" is used here to distinguish from updates, where it is possible that no modification occurred.
  • Metadata properties will be identified by a naming convention: propertyKey:metadataKey . The ':' is an arbitrary choice of a non-alpha character, to make it harder to conflict with other property names. As an example of usage, property foo might have associated metadata properties foo:updateTime, foo:sourceId, and foo:updateValue. Because of the non-alpha, these must be quoted when used, such as in v1.'foo:updateTime' .
  • Since the schema will be used for validation purposes, metadata properties will need to be explicitly declared there just like other properties.
  • We will also explicitly declare which single-valued metadata properties to index, just like other properties. Although we could infer the need to index these, making this explicit will make it easier to see what's indexed.
Clone this wiki locally