Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROUP_CONCAT sorting #9

Open
kasei opened this issue Apr 3, 2019 · 15 comments
Open

GROUP_CONCAT sorting #9

kasei opened this issue Apr 3, 2019 · 15 comments
Labels
query Extends the Query spec

Comments

@kasei
Copy link
Collaborator

kasei commented Apr 3, 2019

There should be a way to sort group rows before aggregation, particularly for GROUP_CONCAT.

Why

The use of GROUP_CONCAT is limited by the output not being deterministic with respect to the ordering of values in the group. For example, a GROUP_CONCAT(?name) aggregate might produce "Alice Bob Eve" in one implementation, and "Bob Alice Eve" in another (or even within the same implementation from different query evaluations). Allowing users to specify an implicit/explicit ordering for rows in an aggregation group would improve interoperability.

Previous work

This was raised as ISSUE-66 and subsequently postponed during work on SPARQL 1.1.

Jindřich Mynarz discusses this as a potential SPARQL 1.2 feature.

Considerations for backward compatibility

At the user-level, this is purely additive as SPARQL 1.1 aggregate groups do not have any explicit ordering. However, it would require updates to the existing definition of SPARQL Algebra Set Functions which are defined in terms of multisets, not ordered sequences.

@JervenBolleman JervenBolleman added the query Extends the Query spec label Apr 3, 2019
@jaw111
Copy link
Contributor

jaw111 commented Apr 8, 2019

Looking at the grammar for Aggregate I never realized one could jam any Expression into an aggregate.

Would it be possible to extend that grammar rule to something like:

'GROUP_CONCAT' '(' 'DISTINCT'? Expression ( 'ORDER' ( 'ASC' | 'DESC' ) )? ( ';' 'SEPARATOR' '=' String )? ')'

To allow things like GROUP_CONCAT(?name ORDER ASC) and GROUP_CONCAT(DISTINCT SHA1(?name) ORDER ASC).

@kasei
Copy link
Collaborator Author

kasei commented Apr 8, 2019

Yes, I think that would be a logical approach to implementing this feature. More complex options may exist for allowing sorting by a value not being used in the CONCAT expression.

@jaw111
Copy link
Contributor

jaw111 commented Apr 8, 2019

Indeed, sorting within a group would be great, I think that could be in scope for #23

@cygri
Copy link

cygri commented May 5, 2019

A limit=x argument for group_concat would also be useful, for use cases such as: “Give me locations and up to six photos from each location.”

@kasei
Copy link
Collaborator Author

kasei commented May 5, 2019

@cygri can you expand on that a bit? How would you use group_concat to do that? I’m interested because that’s exactly the sort of query that motivated me to implement window functions (#47) in SPARQL.

@kasei
Copy link
Collaborator Author

kasei commented May 6, 2019

If we wanted to pursue the more complex option of allowing sorting of GROUP_CONCAT inputs by other values in the input result set, a big question would be how sorting would interact with the DISTINCT flag. Some of the options might include:

  1. Don't allow the use of both DISTINCT and ORDER BY at the same time
  2. Allow both modifiers, and:
    a. Implement the DISTINCT by only including the first occurrence of each value
    b. Implement the DISTINCT as producing distinct values per unique combination of ORDER BY expression values
    c. Implement the DISTINCT as a (perhaps arbitrary) choice between values per unique combination of ORDER BY expression values

Any thoughts? I'm not sure if all of these would be valuable in practice, but I think allowing arbitrary ordering in general is a valuable feature.

@cygri
Copy link

cygri commented May 6, 2019

@kasei I suppose group_concat is kind of the poor man's window functions support. The query for “Give me locations and up to six photos from each location” would be:

SELECT ?location (IRI(?photoStr) AS ?photo) {
    {
        SELELCT ?location (group_concat(?photo; limit=6; separator=" ") AS ?photos) {
            ?location a :Location.
            ?photo :depicts ?location.
        }
        GROUP BY ?location
    }
    ?photoStr apf:strSplit (?photos " ")
}

With VALUES OF (see #6), the apf:strSplit call would be replaced with something like:

VALUES ?photoStr OF split(?photos, " ")

In vanilla SPARQL 1.1, if neither apf:strSplit nor limit on group_concat is available, this actually still can be done, but it's ugly.

group_concat plus some form of split plus some clever string manipulations can go a surprisingly long way (if you can overcome the gag reflex).

P.S.: I have not yet had a chance to really look into window functions, so am open to explore them as a probably superior alternative to this kind of hackery.

@jaw111
Copy link
Contributor

jaw111 commented May 6, 2019

@cygri if the SERVICE Variables from SPARQL 1.1 Federated Query were available, it may work to federate to itself, whereby a series of separate invocations of SPARQL query services would happen where the results of each invocation are combined using union.

SELECT ?location ?photo {
    {
        SELECT ?location (<http://example.com/sparql> AS ?endpoint) {
            ?location a :Location.
        }
    }
    
    SERVICE ?endpoint {
        SELECT ?location ?photo {
            ?photo :depicts ?location.
        }
        LIMIT 6
    }
}

@cygri
Copy link

cygri commented May 6, 2019

@jaw111 Subqueries are evaluated inside-out. So ?location is unbound inside the subquery inside SERVICE. So the LIMIT applies to all photos, and not to photos of a particular location.

@jaw111
Copy link
Contributor

jaw111 commented May 6, 2019

@cygri yes, that's true. I was being optimistic that the query engine would use a VALUES clause to constrain the results by passing the ?location binding for each invocation of the service all the way into the sub-query.

@jaw111
Copy link
Contributor

jaw111 commented May 6, 2019

Another (off the cuff) suggestion borrowing PARTITION from @kasei to serve to specify an order and limit within each 'slice'. Usage:

SELECT ?location ?photo {
    ?location a :Location .
    ?photo :depicts ?location ;
        :date ?date .
}
PARTITION BY ?location
PARTITION ORDER BY DESC(?date)
PARTITION LIMIT 6
ORDER BY ?location ?date
LIMIT 60

So this should return (at most) the 6 most recent photos per location.

@kasei
Copy link
Collaborator Author

kasei commented May 6, 2019

@jaw111 That is almost exactly one of the examples I give in #47 as something that window functions can do (with a more general syntax).

kasei added a commit to kasei/sparql-12 that referenced this issue Jun 21, 2019
@kasei
Copy link
Collaborator Author

kasei commented Jun 21, 2019

I've added some prototype tests for ordering by arbitrary expression here using a strawman syntax:

SELECT (GROUP_CONCAT(?name; ORDER BY ?number) AS ?names)
WHERE {
	VALUES (?name ?number) {
		("four" 4)
		("one" 1)
		("three" 3)
		("two" 2)
	}
}

The ; ORDER BY expr syntax is a bit asymmetric with the existing SEPARATOR="," (but exactly the same as the ORDER BY solution modifier).

@jindrichmynarz
Copy link

Speaking of syntax, it would be good to align it with named arguments for functions (#64).

@kasei
Copy link
Collaborator Author

kasei commented Jun 23, 2019

@jindrichmynarz I'm torn about this. Absent something like #64, I think it would be obvious that we'd want to align the syntax with the existing ORDER BY solution modifier. With #64, we're likely to have two incompatible things we'd like to align with. And unlike SEPARATOR, ORDER BY doesn't have a single scalar value that can use assignment-like syntax. It can have multiple comparators, which can be variables or complex expressions, with optional ordering (ascending/descending). And the current syntax for expressing multiple comparators doesn't even group them syntactically (e.g. with parenthesis), so I'm not sure it's obvious what form a function-argument-like syntax for this would even look like. Do you have ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query Extends the Query spec
Projects
None yet
Development

No branches or pull requests

5 participants