Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced support for lists in queries, results, and graph mutations #151

Open
mgberg opened this issue Sep 8, 2021 · 13 comments
Open

Enhanced support for lists in queries, results, and graph mutations #151

mgberg opened this issue Sep 8, 2021 · 13 comments

Comments

@mgberg
Copy link

mgberg commented Sep 8, 2021

Why?

Other databases and query languages have support for list operations and returning lists of items and SPARQL does not. Creating/modifying RDF lists in a SPARQL CONSTRUCT/UPDATE can be painful. Furthermore, when variables are aggregated, the connections between aggregated variables are lost.

While SPARQL is a pattern matching language and list operations arguably aren't pattern matching operations, when it comes to returning data it would be helpful both for users and for applications to be able to receive results collected into rows, which would allow people to return meaningful rows of data and/or prevent returning duplicate data. This could help make understanding and parsing query results easier when one row per pattern match isn't desired.

Proposed Solution

Example Data

Consider the following example data:

ex:Bob
  a foaf:Person ;
  foaf:name "Bob" ;
  foaf:knows ex:Jane , ex:Sue , ex:Jim ;
  ex:children ex:BobChildList ;
.

ex:BobChildList
  a rdf:List ;
  rdf:first ex:Sally ;
  rdf:rest ( ex:Timmy ex:Lucy ) ;
.

RDF List to SPARQL Result List

It would be convenient if a function was added that explicitly interprets a URI or variable bound to a URI or blank node that represents a RDF List as the contents of the list, perhaps AS_LIST. This would allow the full contents of an RDF List to be returned as a single variable binding which would be convenient for post-processing. These two example queries:

SELECT (AS_LIST(ex:BobChildList) AS ?childList) {}
SELECT ?childList
WHERE {
  ex:Bob ex:children ?listIdentifier .
  BIND ( AS_LIST(?listIdentifier) AS ?childList)
}

would both return this result:

childList
( ex:Sally ex:Timmy ex:Lucy )

Rows to List (and Vice Versa)

It would be convenient to be able to have the same functionality provided by functions like PostgreSQL's ARRAY_AGG or Cypher's COLLECT to transform data across multiple rows in one column into a column of lists as an aggregation operation. It would also be nice to support ordering inside the aggregate.

Query:

SELECT ?person ( COLLECT(DISTINCT ?friend) AS ?friendList )
WHERE {
  ?person foaf:knows ?friend
}
GROUP BY ?person

Result:

person friendList
ex:Bob ( ex:Jane ex:Sue ex:Jim )

Similarly, it would be convenient to be able to have the same functionality provided by functions like PostgreSQL's UNNEST or Cypher's UNWIND to transform data in a column of lists to a row for each element in the list. This could behave sort of like BIND except it could generate multiple result rows.

Query:

SELECT ?person ?child
WHERE {
  ex:Bob ex:children ?listIdentifier .
  UNNEST(AS_LIST(?listIdentifier) AS ?child)
}

Result:

person child
ex:Bob ex:Sally
ex:Bob ex:Timmy
ex:Bob ex:Lucy

In the above example, the list AS_LIST(?listIdentifier) was passed to UNWIND. It would be convenient if UNWIND interprets a non-list argument that is a URI or variable bound to a URI or blank node that represents a RDF List as the contents of the list before UNWINDing (i.e. passes the argument to AS_LIST before UNWINDing if not a list). That would enable the these two queries:

SELECT ?child { UNNEST(ex:BobChildList AS ?child) }
SELECT ?child
WHERE {
  ex:Bob ex:children ?listIdentifier .
  UNWIND(?listIdentifier AS ?child)
}

to return this result:

child
ex:Sally
ex:Timmy
ex:Lucy

This extra convience may be dropped if it would have a negative impact on query evaluation performance.

IN Operator

It would be convenient to extend the IN operator to be able to use a list by identifier or referenced by a variable. For example, the following two queries:

SELECT ( (ex:Timmy IN ex:BobChildList) AS ?result ) {}
SELECT ?result
WHERE {
  ex:Bob ex:children ?childList .
  BIND ( (ex:Timmy IN ?childList) AS ?result )
}

should both return:

result
true

As the current syntax expects a parenthesized expression after IN, putting a URI or variable after the IN as above to notate a resource that corresponds to a RDF List should not cause ambiguity with the current syntax (in a functional sense, at least).

An alternative or additional option would be to use the above AS_LIST first (i.e. the variable passed in was bound to a list). I don't know if this would be more or less performant depending on how many times the list was accessed as well as opportunities for short-circuiting. However, this would also enable the ability to use IN with a list that was previously COLLECTed, which could be very useful. If this was done, the above two queries would become the following.

SELECT ( (ex:Timmy IN AS_LIST(ex:BobChildList)) AS ?result ) {}
SELECT ?result
WHERE {
  ex:Bob ex:children ?childList .
  BIND ( (ex:Timmy IN AS_LIST(?childList)) AS ?result )
}

Creating RDF Lists in CONSTRUCT/UPDATE Queries

It is not convenient to create RDF Lists in CONSTRUCT or UPDATE Queries, especially if the length is variable. If a variable is bound to a list (such as by using COLLECT or AS_LIST above), then it should be able to simply be added to the graph as an RDF List directly. For example:

Query:

CONSTRUCT { ?person ex:friends ?friendList }
WHERE {
  {
    SELECT ?person ( COLLECT(DISTINCT ?friend) AS ?friendList )
    WHERE {
      ?person foaf:knows ?friend
    }
    GROUP BY ?person
  }
}

Result:

ex:Bob
  ex:friends ( ex:Jane ex:Sue ex:Jim ) ;
.

Replacing RDF Lists in UPDATE Queries

The same capabilities could be used to more easily replace lists in INSERT/DELETE queries. For example, suppose that in the example data the friends were accidentally swapped with the children.

Running this query:

DELETE {
  ?person
    foaf:knows ?friend_delete ;
    ex:children ?childList  ;
.
INSERT {
  ?person
    foaf:knows ?child ;
    ex:children ?friendList ;
.
}
WHERE {
  {
    SELECT ?person ( COLLECT(DISTINCT ?friend) AS ?friendList )
    WHERE {
      ?person foaf:knows ?friend
    }
    GROUP BY ?person
  }
  UNION
  {
    ?person foaf:knows ?friend_delete
  }
  UNION
  {
    ?person ex:children ?childListIdentifier
    UNNEST(AS_LIST(?childListIdentifier) AS ?child)
  }
}

Should result in the following data in the store:

ex:Bob
  a foaf:Person ;
  foaf:name "Bob" ;
  foaf:knows ex:Sally , ex:Timmy , ex:Lucy ;
  ex:children ( ex:Jane , ex:Sue, ex:Jim ) ;
.

# This triple was not referenced in the query, so it would remain
ex:BobChildList
  a rdf:List ;
.

Binding lists using BIND and VALUES

If a variable can have a list for a binding, then BIND and VALUES should be extended to be able to supply a list as a binding directly. Consequently, libraries that make SPARQL requests would be able to pre-bind lists to variables. Furthermore, it would be possible to create a list where each element is an already bound variable.

Advanced: List "Mutations" and Access

Some less critical but potentially useful features would be the ability to create new lists by modifying the contents of an existing list (such as by appending an item, inserting an item at a specified position, etc.) or accessing values in a list by index. These kind of features are (in my personal opinion) much lower priority features than AS_LIST/COLLECT/UNNEST.

Others?

I'm sure there are other useful list operations that I'm not thinking of at the moment that could be useful to include.

Considerations for backward compatibility

Aside possibly for the syntax for IN expressions, nothing jumps out at me as affecting backwards compatibility. Perhaps someone who is more familiar with the inner workings of SPARQL algebra and query evaluation would find additional issues.

@gkellogg
Copy link
Member

gkellogg commented Sep 8, 2021

Notation3 has a number of useful List builtins that might be useful as the basis for SPARQL builtins.

@mgberg
Copy link
Author

mgberg commented Sep 8, 2021

Notation3 has a number of useful List builtins that might be useful as the basis for SPARQL builtins.

Good point, I forgot about those.

@namedgraph
Copy link

Looks quite a bit like XPath 3.1 array functions.

@ericprud
Copy link
Member

SWObjects has a notion of generators which can be used in place of a Collection. Instead of writing e.g.

SELECT ?flag { ?flag :colors (:blue :white :red) } # France, maybe a few others. US is in other order.

you can ask

SELECT ?flag { ?flag :colors UNORDERED(:blue :white :red) } # lots and lots of countries.

The MEMBERS function is like List to Rows above. Also includes:

  • STARTS
  • ENDS
  • ANY (includes one or more of)

and should probably include AT_LEAST.

Bindings produced by a List to Rows function marked the result set as ordered (which prohibits certain optimizations).

@kurtcagle
Copy link

Some less critical but potentially useful features would be the ability to create new lists by modifying the contents of an existing list (such as by appending an item, inserting an item at a specified position, etc.) or accessing values in a list by index. These kind of features are (in my personal opinion) much lower priority features than AS_LIST/COLLECT/UNNEST.

Access is a no-brainer. Several that come to mind:

AT(?list,?index) - given a list pointer, retrieves the item at the ?index'd position, or NULL if nothing is found at that index.
MEMBER(?list,?pointer) - retrieves the value of the member associated with the given pointer. (the rdf:first object)
POINTER(?list,?index) - similar to AT, but retrieves the blank node at the indexed position.
LENGTH(?list) - Given a list pointer, retrieve the number of items from that pointer to the rdf:nil pointer.
TAIL(?pointer) - Retrieves the last pointer in a list with a given pointer. ​
HEAD(?pointer) - Retrieves the first pointer in a list with a given pointer. ​
FIND(?list,?value) - returns the first pointer for which the rdf:first value is ?value.
PREVIOUS(?pointer) - Given a pointer into the list, returns the previous pointer in the chain
NEXT(?pointer) - Given a pointer into the list, returns the next pointer in the chain.

I'm inclined to see mutations as imperatives:

MUTATION {
INSERT_AFTER(TAIL(?chapterList),?value)
​}
WHERE {
​?myBook ?myBook:hasChapterList ?chapterList.
bind("The Important Chapter" as ?value)​
​}
Appends "The Important Chapter to the end of the list".

Mutations obviously open up a larger can of worms.

@afs
Copy link
Collaborator

afs commented Oct 5, 2021

@gkellogg
Copy link
Member

gkellogg commented Oct 5, 2021

Many of th N3 list builtins could be considered as a basis for SPARQL methods.

@namedgraph
Copy link

Wouldn't it make sense to continue to be aligned with XPath?

@ericprud
Copy link
Member

ericprud commented Oct 6, 2021

Noting that @namedgraph also said

Looks quite a bit like XPath 3.1 array functions.

, it looks like it's time to do some homework.

An apparent, but unstated semantics of the SPARQL grammar is that you can only bind new variables in a few places:

This excludes FILTER so I think the only ways we can use functions named by URLs (i.e. FunctionCall) is with BIND or maybe some (ab?)use of SERVICE. Additionally, the results of BGP and join operations are unordered; ORDER is imposed at the end. This means that anything we do to bind list members in order is going to impose join semantics beyond those of SPARQL 1.1 (and possibly break some optimizations).

XPath operations produce sequences (maybe was node-set back in 1.0?). The core path expressions like /* map the DOM children to a sequence. The XPath array functions operate over array(*) (I don't know why that's not a sequence, but there's a constructor to map sequences to arrays. Whatever) and come int scalar an array(-generating) flavors:

In order to implement any functions or operators to flatten list members, I think we first have to postulate an ordered flag on results (at least, that's what I did when adding list operators to SWObjects). This constrains join, though probably in a way that naive implementations already conform to.

As far as I can see, the use of e.g. /* to bind list members, plus XPAth array functions, doesn't give us the ops we'd want for query, e.g. set and list (ordered and unordered) variants of starts, ends, includes, excludes, but I'm keen to see formulations that prove me wrong. Hopefully, this leg work will lead to a strawman.

🔴 This is the best of my recollection right now but @afs and others are likely to correct some mistakes. edits welcome.

@namedgraph
Copy link

@ericprud XSLT/XPath 1.0 had node-sets, 2.0 introduced sequences, 3.0 introduced arrays and maps.
So sequences should not be confused with arrays. In 3.0 arrays and maps are standalone constructs with their own constructors.

@namedgraph
Copy link

Looks like this issue overlaps with #46 at least partially.

@mgberg
Copy link
Author

mgberg commented Oct 11, 2021

SPARQL Nested Tables is another but not completely overlapping proposal that could solve some of the issues discussed above which could definitely help with coercing graph query results into a tabular format.

@rvesse
Copy link

rvesse commented Dec 14, 2023

Also seems like some relationship to #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants