-
Notifications
You must be signed in to change notification settings - Fork 7
Provenance
GrantNakamura edited this page Apr 29, 2015
·
1 revision
For the purposes of this project, we define provenance as metadata providing information about how a given property's value evolved. That history supports several goals, including:
- Better understanding the significance of a given property
- Identifying trending changes in the graph database
- Aiding decisions in graph alignment Provenance could be as complex as a complete history of changes, or as minimal as recording the most recent modification times. Ideally, we would like a fairly detailed history. However, Gremlin has several constraints that impact our ability to provide this. The most serious constraint is that Gremlin supports indexing of only single-valued properties, making it difficult to have a list-like history. Another constraint is that while Gremlin supports sub-properties, they must be single-valued.
We discussed these constraints in a team meeting, and agreed that we still wanted a detailed history even if that means losing the ability to index the provenance data.
We propose:
- Using metadata properties providing information about other properties
- Metadata properties are optional, to be used only where useful
- Most metadata properties will be list-valued. In general, these are:
- updateTime. This will list times when updated information came in relevant to determining a property's value.
- sourceId.
- updateValue. Note: In some cases, the data value may be unchanged (for example, if a new source came in but reported the same value). These lists are parallel. That is, updateTime[i], sourceId[i], and updateValue[i] are all metadata for the same event.
- Some metadata properties will be single-valued, solely for the purpose of indexing. For example, having a **lastModifiedTime **could be useful for finding nodes where an important property changed within a given time window. Note: "Modified" is used here to distinguish from updates, where it is possible that no modification occurred.
- Metadata properties will be identified by a naming convention: propertyKey:metadataKey . The ':' is an arbitrary choice of a non-alpha character, to make it harder to conflict with other property names. As an example of usage, property foo might have associated metadata properties foo:updateTime, foo:sourceId, and foo:updateValue. Because of the non-alpha, these must be quoted when used, such as in v1.'foo:updateTime' .
- Since the schema will be used for validation purposes, metadata properties will need to be explicitly declared there just like other properties.
- We will also explicitly declare which single-valued metadata properties to index, just like other properties. Although we could infer the need to index these, making this explicit will make it easier to see what's indexed.