Use blank nodes for array/list cells and possibly other things #1390

matko · 2022-08-22T11:45:36Z

matko
Aug 22, 2022
Maintainer

We currently force every node in our system to have an actual name which we can report back to the user, and which the user can query by. However, we also have various nodes that don't really need a name, as they're simply connective tissue between things.

The primary example of this are array and list cells. Unless the user does a triple dump, these are never directly exposed. Instead, our code will take the triple structure and turn them into a nice json list. Similarly, the user never actually queries them directly. Instead, they're queried through a containing structure, or backwards through a contained structure.

Meanwhile, having these nodes explicitly named does incur a cost. To ensure global uniqueness, we generate a hefty randomized name. In many cases, this name might actually be way longer than the thing we're storing inside of the array or list. This causes the database to inflate in size.

As a solution, we could define a range of node ids which do not map to a name. They remain blank nodes. For this we can use a similar schema as is proposed in (#1389): on each layer, we store a number representing how many blank nodes this layer needs. This then takes up a range between the nodes and the values.

For example, suppose a base layer has 10 normal nodes, 3 blank nodes, and 5 values. The normal nodes would get ids 1-10, the blank nodes 11-13, and the values 14-18.
If a child layer uses 2 additional normal nodes, 3 blank nodes, and 5 values, its normal nodes would get ids 19-20, its blank nodes 21-23, and its values 24-28.

When dumped as rdf in a format that does not support node eliding (turtle does support this), or when a name is required in another context, such as an explicit user query, we can use blank node syntax for these nodes, _:<id>. When ingesting an rdf containing blank nodes, we can consider those blank nodes to be local to the stuff to be ingested, and remap the given labels of those blank nodes (if any) to new blank nodes in the layer being inserted. When querying, we can simply error when a blank node is explicitly queried (users should use variables when they don't care about the node's identity, and we should simply not support direct retrieval by the 'name' of a blank node).

The main advantage of this scheme would be that it allows us to have very cheap intermediate nodes. In addition to array/list cells, this could actually also be useful for things like subdocuments (provided that direct retrieval or uniqueness enforcement through naming are not required), or nested json structures (provided we don't want to use the valuehashed names for structure reuse).

matko · 2022-08-22T12:19:55Z

matko
Aug 22, 2022
Maintainer Author

This scheme has some advantages over the proposal in https://github.com/orgs/terminusdb/discussions/1389.
Most obviously, it'd retain n-dimensionality.
More important though, it'd retain a node for a position in an array/list. Having such a node is great for cases where we want to point into a list or array to a particular position, regardless of what is there. That way, we can modify what is there without having to update these links. Furthermore, in case of a linked list, we can actually insert stuff in the middle without invalidating any of the existing cells.

To fully leverage this though, we need weak links, which we don't have at this point. Proposal incoming.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Use blank nodes for array/list cells and possibly other things #1390

{{title}}

Replies: 1 comment

{{title}}

Select a reply

TerminusDB

Use blank nodes for array/list cells and possibly other things #1390

matko Aug 22, 2022 Maintainer

Replies: 1 comment

matko Aug 22, 2022 Maintainer Author

matko
Aug 22, 2022
Maintainer

matko
Aug 22, 2022
Maintainer Author