resultSet wrapper object for all responses #68

Tom-Shorter · 2021-05-07T14:08:46Z

There are a few problems with the current implementation of results, for g_variants there is a problem with pagination #26, for biosamples there is a problem with aggregation/dataset identification #67. There are a few other problems but instead of going over them in detail I would like to propose a new resultSet wrapper class and all response.results are always a list of resultSet objects.

@mbaudis has also mentioned this concept in #26, #67, #63 (and possibly elsewhere that I am missing) and his description of the concept from #26 (quoted below) is almost exactly what I was envisioning.

As in the linked discussions: IMO it would be good to have always one or more resultSet objects in the response. So basically one would introduce another layer instead of now

- results => array of responseResults (e.g. VariantInSampleResponseResults)

it becomes

- results => array of resultSet objects (with info about the resultSet, e.g. dataset...) => array of responseResults (e.g. VariantInSampleResponseResults)

The only change I would make is that results would not be an array of resultSet objects but instead an object containing resultSet objects where each resultSet object is identified by a unique key. This change is purely for making life easier for UI implementers as each resultSet can now be identified without having to look at a property of the resultSet. The unique key will likely be the datasetID or other such ID. I dont think this is possible to describe within openAPI however so I am more than happy to have an array of resultSets for simplicity as it makes little difference to the final outcome

A resultSet object can be formed from any responseObject, such as a genomicVariantResponse or a cohortResponse as well as any additional response objects which will be added in the future.

The below is an idea of what properties a resultSet object might have:

Identifier(required) - only if an array of resultSets is chosen over an object of resultSets, each given a unique key
source (required) - An array of unique ID's for each source where data was gathered to create this resultSet, such as a datasetID
responseType (required) - This would detail if the resultSet is a dataset, a cohort, etc. etc. This could likely be implied from the unique key given to the resultSet object/ identifier property or from the request that was made to the beacon but I do not want to assume that is always the case
exists (required) - boolean
count (optional) - no. of results
selectionCriteria (optional) - I am unsure about many of the details around cohorts but a property within the resultSet which details the applied selection parameters would likely be useful
results (optional - depends on level of access) - this will be an array of results, much like what is currently returned by response.results

The inclusion of this wrapper will also be of benefit when incorporating AAI into beacon v2, @jrambla has proposed a 4 stage access model as described below:

overall exists/count
dataset exists/count
minimal data response
expanded data response

The current response model does not allow for the above as there is no field for dataset exists/count other than within the datasetAlleleResponse object as far as I am aware. Adding in the resultSet wrapper will provide a logical place for these fields to be inserted and stage 2 - "dataset exists/count" will become "resultSet exists/count". If the user is able to access stages 3 and 4 of data then the resultSet object just needs to be expanded with the results property. Each resultSet object can also provide different information for stages 3 and 4 when compared to other resultSet objects of the same type as these can be tailored to each dataset etc.

This wrapper also solves a possible implementation problem with the aggregation of data from different datasets which should not be aggregated, such as individuals and biosamples. A beacon serving 4 datasets submitted from the same research group could come across a problem where the groups standard naming convention for individuals is "ind{X}" where x is an auto-incrementing integer, therefore there would be four version of "ind1" within the beacon, if the beacon applies aggregation to these individuals then a single result could be returned that is present within the 4 datasets but upon further inspection you realise that "ind1" is actually a different individual within each of the datasets and as such shouldn't have been aggregated. This issue should not arise with variants as there exist standardised naming conventions for variants, such as rsid's and HGVS.

This situation is also possible without aggregation due to the lack of a datasetID within /biosample responses, if a request was sent to /individuals/ind1/biosamples then all responses would look like they come from the same individual whereas they actually come from 4 individuals.

A lot of thought will need to be put into how this would actually be implemented and the properties to be included in the new object.

The text was updated successfully, but these errors were encountered:

jrambla · 2021-05-12T20:49:00Z

@Tom-Shorter @mbaudis @sdelatorrep as mentioned in the mail thread, I would like to see an example of the above suggestion.
The legacy DatasetAlleleResponse model is no longer applicable since we have cohorts as another collection type, therefore my assumption that applying the DatasetAlleleResponse could solve the need was incorrect/incomplete.
Having said so, the issue #62 should be combined with the suggested resultset.

mbaudis · 2021-05-12T20:59:44Z

@Tom-Shorter

The only change I would make is that results would not be an array of resultSet objects but instead an object containing resultSet objects where each resultSet object is identified by a unique key.

I understand this - we're doing something similar for cases where the objects are in a defined scope (e.g. retrieving dataset statistics, where the identifier then is the dataset_id). However, this may be problematic if having a response with a list/set ... where the scope of the identifier isnt't clear (dataset, cohort ...).

colinveal · 2021-05-13T15:32:56Z

@jrambla @mbaudis @Tom-Shorter @sdelatorrep So for us something like below would be ok to give us the splits of information per dataset whilst retaining a single section for results, however I'm still not sure whether dividing results into resultSets would better, what do you think @Tom-Shorter ?

"response": {
    "exists": true,
    "error": {
      "errorCode": "same as HTTP status code",
      "errorMessage": "string"
    },
    "numTotalResults": 100,
    "results": [
      {
"response": {
    "exists": true,
    "error": {
      "errorCode": "same as HTTP status code",
      "errorMessage": "string"
    },
    "numTotalResults": 0,
    "datasets":[
      {
           "datasetA": {
                 "exists":true,
                 "count":50,
                 "link":"",
                 "contact":"",
                 "etc":""
               }
       },{
           "datasetB": {
                 "exists":false,
                 "count":0,
                 "link":"",
                 "contact":"",
                 "etc":""
                }
        }
],
    "results": [
      {}]

mbaudis · 2021-05-13T15:51:03Z

Yes, this is one of the options which IMO work; having the per-resultSet aggregation data separate.

Notes:

It shouldn't be datasets then, but resultSets
I wouldn't use the current structure (I know this is just a doodle, but FYI) but keep it more like:

response:
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      biosampleCount: 116'232
      ...
    - id: datasetB
  ...
  results:
  ...

Still, the tagging of the individual response elements for their parent resultSet remains ...

Tom-Shorter · 2021-05-14T07:38:47Z

Still, the tagging of the individual response elements for their parent resultSet remains ...

I was going to say that this depends entirely on how the unique ID for a result is created but if the unique ID is created from a concatenation of a dataset ID and the result ID then if querying a cohort endpoint the unique ID would then not suffice as it would not contain the cohort ID so tagging would still be needed.

I think many users wont use beacon as a way of obtaining data, at least not through the results array, but the data handover may be more popular due to the data being in a standard format and not a beacon result format which will require more work to parse and may or may not contain the information you actually want. Instead users might use beacon as a way to locate where they can find suitable data, using the the results as an indicator of what data is actually available and then use the data sources own interface to query the data in a more customised way. No matter how well the beacon spec is designed a beacon query will not out perform a custom interface for the querying of data and the beacon results response will unlikely be as detailed or as easy to parse as one of the many standards already out there

The below is what I envisioned when I first started talking about resultSets but I do agree that @colinveal's approach will also work:

response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...

This approach allows you to pull out results from any dataset/cohort you want without having to loop through each result and find out which dataset it comes from by looking at a tag.

As mentioned in my original comment this could be slightly modified to have an object of resultSets identified by their ID, this allows for more precise data look ups as instead of having to loop through a list of resultSets to find the one with a matching ID you can instead access the correct resultSet directly.

response:
  exists: true
  numTotalResults: 100
  resultSets:
    "datasetA":
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    "datasetB":
  ...

Another consideration is the aggregation of common data, such as is seen in g_variants. It does make sense to only display this information once but if I query two datasets, one 0 based and the other 1 based then the position of the same variant will differ and the common data will not be common.

In the /filtering_terms endpoint we use the ontologyResource object as a datastore to describe ontologies, maybe the same could be true for other endpoints and we have, for example, a variantResource object which holds information about all variants seen in the results and then the results are linked to a variantResource entry though the variant ID. A caveat will have to be placed on this aggregation, only where beacons know with 100% certainty that an entity in two datasets is the same should they both point to the same resource. 100% certainty in this instance would be where it has been explicitly stated by those who did the analysis that they are the same or where each individual/biosample/... is assigned a unique ID which is used to identify them throughout all studies carried out, datasets created from those studies and additional cohorts the individual is part of. Aggregation without this level of certainty will lead to errors.

mbaudis · 2021-05-14T07:43:32Z

Ahem:

one 0 based and the other 1 based ... the common data will not be common

This case should not be supported.

Tom-Shorter · 2021-05-14T07:54:51Z

I agree, it shouldn't be supported in an ideal world but the beacon owner might not have control over the data in the beacon so they can't standardise it. Does the beacon owner then return all results using the same 0 or 1 base (I forget which is the beacon standard) knowing that the positions provided for variants would be misleading if the user was to use the data sources own query interface or do they return results that remain true to the data, eventhough this means possibly making more than one variantResource for the same variant. Either situation is not ideal and will both be of annoyance to a beacon user so it might be a case of choosing the lesser of two evils in how to handle this situation.

colinveal · 2021-05-14T08:37:26Z

So in summary for me any of these three would be fine, with a preference for the latter two:

response:
  ...
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      biosampleCount: 116'232
      ...
    - id: datasetB
  ...
  results:
  ...

response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...

response:
  exists: true
  numTotalResults: 100
  resultSets:
    "datasetA":
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    "datasetB":
  ...

mbaudis · 2021-05-14T08:40:51Z

@colinveal @Tom-Shorter @jrambla Can live with any of those, too, with 1 ~ 2 > 3, after some thinking :-)

Tom-Shorter · 2021-05-14T08:42:40Z

^ ditto

jrambla · 2021-05-14T10:49:37Z

Lovely discussion! Let’s refine it, or not ;-) , in the meeting. Jordi From: Tom Shorter ***@***.*** Sent: viernes, 14 de mayo de 2021 10:43 To: ga4gh-beacon/specification-v2 ***@***.***> Cc: Jordi Rambla ***@***.***>; Mention ***@***.***> Subject: Re: [ga4gh-beacon/specification-v2] resultSet wrapper object for all responses (#68) ^ ditto — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#68 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5SEOQ252EVT7426KARQQDTNTPBFANCNFSM44KFH5YA>.

sdelatorrep · 2021-05-17T09:19:34Z

Hi @Tom-Shorter! Could you create a PR introducing this change? Many thanks!

Tom-Shorter · 2021-05-17T09:23:51Z

hi @sdelatorrep, I made a pull request last week : #72 I probably haven't done everything correctly if you haven't been notified of it though so let me know.

sdelatorrep · 2021-05-17T10:11:04Z

Ah, sorry, I didn't see it. I'll review it asap. Thanks!

sdelatorrep · 2021-05-17T10:18:44Z

But @Tom-Shorter , this PR is only about filters, isn't it? I don't see anything about resultSet in it.

Tom-Shorter · 2021-05-17T10:47:43Z

Oh sorry @sdelatorrep , I'm an idiot.I saw pull request and immediately thought of the filters scout group pull request I made.

In the beacon call it was agreed that it would be best for someone from your group to add the resultsets in to the spec, I'm familiar with the g_variants endpoint responses but I haven't had to implement the other endpoints yet so I thought it best that someone who was familiar with the endpoints should update the spec.

We don't mind which properties are added to the resultSet wrapper object as long as it has the exists/count so feel free to set the object up however you wish to make sure it fits in with the rest of the beacon specs.

sdelatorrep · 2021-05-19T13:39:24Z

AFAIK, the winner was option 2 (see all here):

response:
  exists: true
  numTotalResults: 100
  resultSets:
    - id: "datasetA"
      type: "dataset"
      exists: true
      resultsCount: 116'232
      dataHandover: 
      results: []
      ...
    - id: datasetB
  ...

I'll add this to the spec.

sdelatorrep · 2021-05-21T06:42:56Z

Hi @Tom-Shorter! I've just created the PR implementing this. Would you mind taking a look at it? I couldn't add you as a reviewer, but I've sent you an invitation to the repo, if you want to join :) Thanks!

mbaudis · 2021-05-21T07:01:51Z

@Tom-Shorter @sdelatorrep I've just approved, but with a comment, especially regarding additional parameters for ResultSet (counts!):

in principle extend it to a subset of the BeaconDataset parameters, w/o making it explicitly a dataset

Use result set wrapper (#68)

jrambla · 2021-07-01T18:21:41Z

This is solved.

mbaudis mentioned this issue May 9, 2021

/biosamples response is missing a datasetID field #67

Closed

sdelatorrep added a commit that referenced this issue May 21, 2021

Use resultSet wrapper (#68)

d3a8d4e

sdelatorrep added a commit that referenced this issue May 21, 2021

Must be an array. Rename to plural (#68)

c3b3340

sdelatorrep added this to the version 2.0 milestone May 21, 2021

mbaudis added the enhancement New feature or request label May 26, 2021

sdelatorrep added a commit that referenced this issue Jun 7, 2021

Merge pull request #73 from ga4gh-beacon/use-result-set-wrapper

51f3e1d

Use result set wrapper (#68)

jrambla closed this as completed Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resultSet wrapper object for all responses #68

resultSet wrapper object for all responses #68

Tom-Shorter commented May 7, 2021

jrambla commented May 12, 2021

mbaudis commented May 12, 2021

colinveal commented May 13, 2021

mbaudis commented May 13, 2021

Tom-Shorter commented May 14, 2021

mbaudis commented May 14, 2021

Tom-Shorter commented May 14, 2021

colinveal commented May 14, 2021

mbaudis commented May 14, 2021

Tom-Shorter commented May 14, 2021

jrambla commented May 14, 2021 via email

sdelatorrep commented May 17, 2021

Tom-Shorter commented May 17, 2021

sdelatorrep commented May 17, 2021

sdelatorrep commented May 17, 2021

Tom-Shorter commented May 17, 2021

sdelatorrep commented May 19, 2021 •

edited

Loading

sdelatorrep commented May 21, 2021

mbaudis commented May 21, 2021

jrambla commented Jul 1, 2021

resultSet wrapper object for all responses #68

resultSet wrapper object for all responses #68

Comments

Tom-Shorter commented May 7, 2021

jrambla commented May 12, 2021

mbaudis commented May 12, 2021

colinveal commented May 13, 2021

mbaudis commented May 13, 2021

Tom-Shorter commented May 14, 2021

mbaudis commented May 14, 2021

Tom-Shorter commented May 14, 2021

colinveal commented May 14, 2021

mbaudis commented May 14, 2021

Tom-Shorter commented May 14, 2021

jrambla commented May 14, 2021 via email

sdelatorrep commented May 17, 2021

Tom-Shorter commented May 17, 2021

sdelatorrep commented May 17, 2021

sdelatorrep commented May 17, 2021

Tom-Shorter commented May 17, 2021

sdelatorrep commented May 19, 2021 • edited Loading

sdelatorrep commented May 21, 2021

mbaudis commented May 21, 2021

jrambla commented Jul 1, 2021

sdelatorrep commented May 19, 2021 •

edited

Loading