Replies: 1 comment
-
This scheme has some advantages over the proposal in https://github.com/orgs/terminusdb/discussions/1389. To fully leverage this though, we need weak links, which we don't have at this point. Proposal incoming. |
Beta Was this translation helpful? Give feedback.
-
We currently force every node in our system to have an actual name which we can report back to the user, and which the user can query by. However, we also have various nodes that don't really need a name, as they're simply connective tissue between things.
The primary example of this are array and list cells. Unless the user does a triple dump, these are never directly exposed. Instead, our code will take the triple structure and turn them into a nice json list. Similarly, the user never actually queries them directly. Instead, they're queried through a containing structure, or backwards through a contained structure.
Meanwhile, having these nodes explicitly named does incur a cost. To ensure global uniqueness, we generate a hefty randomized name. In many cases, this name might actually be way longer than the thing we're storing inside of the array or list. This causes the database to inflate in size.
As a solution, we could define a range of node ids which do not map to a name. They remain blank nodes. For this we can use a similar schema as is proposed in (#1389): on each layer, we store a number representing how many blank nodes this layer needs. This then takes up a range between the nodes and the values.
For example, suppose a base layer has 10 normal nodes, 3 blank nodes, and 5 values. The normal nodes would get ids 1-10, the blank nodes 11-13, and the values 14-18.
If a child layer uses 2 additional normal nodes, 3 blank nodes, and 5 values, its normal nodes would get ids 19-20, its blank nodes 21-23, and its values 24-28.
When dumped as rdf in a format that does not support node eliding (turtle does support this), or when a name is required in another context, such as an explicit user query, we can use blank node syntax for these nodes,
_:<id>
. When ingesting an rdf containing blank nodes, we can consider those blank nodes to be local to the stuff to be ingested, and remap the given labels of those blank nodes (if any) to new blank nodes in the layer being inserted. When querying, we can simply error when a blank node is explicitly queried (users should use variables when they don't care about the node's identity, and we should simply not support direct retrieval by the 'name' of a blank node).The main advantage of this scheme would be that it allows us to have very cheap intermediate nodes. In addition to array/list cells, this could actually also be useful for things like subdocuments (provided that direct retrieval or uniqueness enforcement through naming are not required), or nested json structures (provided we don't want to use the valuehashed names for structure reuse).
Beta Was this translation helpful? Give feedback.
All reactions