You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.
Suggest decentralized peer to peer over hub and spoke.
Each beacon can choose to know about some other set of beacons. Each beacon gets to maintain its list of known peers however it likes. With the most simple decentralized network architecture in place we can begin to discuss how one might federate a query to a beacon's peers.
One can still create hubs, or network bottlenecks by only setting the peers for a beacon to some central node. In a peer-to-peer protocol, this is still an option. Instead of developing an over-the-top protocol on the beacon API, we might consider adding to the spec the ability for beacons to list peers. This would lead to a lighter stack that has clear modes of participation.
We might not need to write new software, just present add simple interface for managing peers at the beacon level. The number of peers could be very small and still support a healthy network.
Suggest decentralized peer to peer over hub and spoke.
Each beacon can choose to know about some other set of beacons. Each beacon gets to maintain its list of known peers however it likes. With the most simple decentralized network architecture in place we can begin to discuss how one might federate a query to a beacon's peers.
One can still create hubs, or network bottlenecks by only setting the peers for a beacon to some central node. In a peer-to-peer protocol, this is still an option. Instead of developing an over-the-top protocol on the beacon API, we might consider adding to the spec the ability for beacons to list peers. This would lead to a lighter stack that has clear modes of participation.
We might not need to write new software, just present add simple interface for managing peers at the beacon level. The number of peers could be very small and still support a healthy network.
Wanted to give y'all something to chew on. This first diagram attempts to capture a little about organizing a p2p network architecture. In the center you see the backbone with Elixir, DNAStack, UCSC listing each other as peers. From there the individual subnets, including one for a private healthcare institution are shown. Thick lines show peers that accept any connection, dashed show cases where a peer only accepts requests from a whitelisted entry.
Note that all actual data flows over HTTP methods, there is no bittorrent or anything hiding here.
I hope that this shows how the various genomics APIs work together to create a service architecture that lets clients "upgrade" their connection from an oracle response to raw genomics data. ga4gh-beacon/specification#68
** NOTE ** I mean to say logical OR of results for the aggregator, not AND.
I also took a moment to gather the thoughts around how a GA4GH Network Explorer might work. This is similar in spirit to Marc's beacon registry, except it pulls from the known good nodes to construct its network representation.
@mfiume Could you post the open concerns that were raised in the call?
Sybil attack (rogue nodes) - One node might claim to be another node. Proper use of DNS and SSL ameliorates this problem. A node stating it is .nih.gov would have had to have hacked their DNS.
Bustamente attack - With enough prior knowledge of a sample, you can reason about its presence via a series of oracle responses. Allowing nodes to confer about what requests they have received would allow us to mitigate at the network level, I hope we can constrain it to the individual beacon.
Federated queries - I should not have to know all the nodes of the network to ask a network-wide question. Setting up a peer to peer protocol where a clever client can federate a query itself is the first step to implementing it at the interservice level.
Traffic shaping - In a peer to peer network, super nodes, proxies, caches, and peer registries all provide ways that network participants can influence activity. However, the protocol allows one to make ad-hoc relationships, which by definition will only be observable by those clients (and their internet providers).
Crawling - As mentioned above, starting from a single good peer in a well-connected network, we will eventually be able to observe the network at some point in time.
Privacy - I hope that there will be one global genomics network, however, the protocol allows individual institutions to set up a network topology that suits their goals.
Of course, if you have any questions, just let me know!
I enjoyed the discussion today too. There are a few different organizations within GA4Gh that need this capability. We are not the first ones to try to build a distributed network or to create an architecture compatible with federated queries. It makes sense to me to collaborate on this effort.
I agree with Jordi, it will help if we can list out the specific use cases we would like to meet via this design. I think Beacon came to this problem from already having a centralized solution implemented at DNAStack. They have some Beacon related use cases to get us started. The Genomics API also has some objectives in creating a directory. Let’s see if all the use cases are harmonious.
Let me make a first cut at the use cases that I know of (please add to these!)
Discovery
Be able to discover all nodes in the network of a particular type (Please list all the Genomics API nodes are there that support Variant and Read APIs) It is expected that a first-time user would use the GA4GH website to get started, but it could be started from any node connected via some route to GA4GH.
Be able to discover all nodes in the network of a particular type running a version >= x.y.z (Please list all the Genomics API nodes are there that support Variant APIs running SW >= 0.6.0) It is expected that a first-time user would use the GA4GH website to get started, but it could be started from any node connected via some route to GA4GH.
Generate a list of all nodes connected to the network (regardless of type or version). This would list out all the servers along with their info (what characteristics they have and what version they are running, plus whatever else is in the info field, URL for instance)
Delegation (the ability to delegate work to nodes in the graph)
Issue a job to a node and that node would then compute a result across all of it’s peers of a compatible type. This could be a search for across all Genomics API servers for any Biosamples that exist with a disease field set to a particular type of carcinoma. Note that the search would obviously be made against nodes that supported Biosamples, and were running a version of the API compatible with the starting node. One could make a similar Beacon request or a variant search request. One could also do a search for servers that have RNA expression data for a particular gene. There are many examples of the work this sort of general job descriptor could handle.
?
Implementation notes
(Sorry, I can’t help myself!)
I admit that I have already started to state the use cases with a particular implementation in mind. We have been working on a peer service (as we mentioned in the meeting). We wrote up a little document that describes the protocol which you can see here https://docs.google.com/document/d/1hc-l7P0S0G8j19n0dKV9e0AeF8zgE3I88fNveSC178w/edit#heading=h.1j4vc4ls6v7v (you may need me to give you permission to see it, if so then just let me know).
I think that we can separate the problem space here a bit. The communication protocol required to build a distributed peer network should be separated from the delegation functionality. Once we build the network then we can construct some form of map-reduction on top to distribute work and collect results. We will need to provide some logic to help implementers join and manage the network of interconnections in a way that provides a good basis for distributed workloads. There has been some interesting work done in this area https://gnunet.org/sites/default/files/gossip-podc05.pdf.
Thanks for reading this far, and please help with the use case list.
-Kevin
On Feb 22, 2017, at 3:15 PM, David Steinberg [email protected] wrote:
Really enjoyed the conversation this morning!
Wanted to give y'all something to chew on. This first diagram attempts to capture a little about organizing a p2p network architecture. In the center you see the backbone with Elixir, DNAStack, UCSC listing each other as peers. From there the individual subnets, including one for a private healthcare institution are shown. Thick lines show peers that accept any connection, dashed show cases where a peer only accepts requests from a whitelisted entry.
Note that all actual data flows over the existing HTTP methods, there is no bittorrent or anything hiding here.
I hope that this shows how the various genomics APIs work together to create a service architecture that lets clients "upgrade" their connection from an oracle response to raw genomics data.
Sybil attack (rogue nodes) - One node might claim to be another node. Proper use of DNS and SSL ameliorates this problem. A node stating it is .nih.gov would have had to have hacked their DNS.
Bustamente attack - With enough a prior knowledge of a sample, you can reason about its presence via a series of oracle responses. Allowing nodes to confer about what requests they have received would allow us to mitigate at the network level, I hope we can constrain it to the individual beacon.
Federated queries - I should not have to know all the nodes of the network to ask a network-wide question. Setting up a peer to peer protocol where a clever client can federate a query itself is the first step to implementing it at the interservice level.
Traffic shaping - In a peer to peer network, super nodes, proxies, caches, and peer registries all provide ways that network participants can influence activity. However, the protocol allows one to make ad-hoc relationships, which by definition will only be observable by those clients (and their internet providers).
Crawling - As mentioned above, starting from a single good peer in a well-connected network, we will eventually be able to observe the network at some point in time.
Privacy - I hope that there will be one global genomics network, however, the protocol allows individual institutions to set up a network topology that suits their goals.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub ga4gh-beacon/specification#75 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AQvoM1Wmg7IAedyykGdEUuImjGZ_9jXGks5rfMF-gaJpZM4L5iok.
So I'm part of a team (connecting institutes in Toronto, Vancouver, and Montreal) that's starting to build something like that network of hospitals in the bottom corner of David's first diagram, to perform national-scale analysis over private, locally-controlled health research data using GA4GH APIs. We're just getting up to speed now.
I can see the advantages of dynamic peer discovery for discovery in many use cases, even if not ours - our network of participating sites isn't going to change very often. I do have a question about the delegation part though.
What does the working group see as the advantages of having delegation occur on the server/API side as opposed to having a server list (well-known servers in our case, or obtained from the peer API) and having the client perform multiple individual queries against those servers?
For our use cases, there are at least two important potential advantages of pushing delegation into the API, but they would involve features that as far as I know aren't on any roadmap and so currently require us to build a layer on top of the GA4GH APIs. On the other hand if there was appetite for having those be part of the GA4GH API that would change how we approach some development tasks.
The main advantage of delegation is when you have a large number of servers
to control. Having a single client machine collect all the data from 100 or
more systems is not very efficient. That solution follows the start network
pattern, and is fine for small numbers of nodes. The planning for the
delegation feature is a way for us to look forward and plan for an
efficient information gathering network in the future when we have 1000's
of nodes (or more).
I am interested to know your use case for delegation and what capability
you are building on top of the GA4GH APIs. Have you proposed any
enhancements to support your needs?
Thanks,
-Kevin
On Thu, Feb 23, 2017 at 8:54 AM, Jonathan Dursi [email protected]
wrote:
So I'm part of a team (connecting institutes in Toronto, Vancouver, and
Montreal) that's starting to build something like that network of hospitals
in the bottom corner of David's first diagram, to perform national-scale
analysis over private, locally-controlled health research data using GA4GH
APIs. We're just getting up to speed now.
I can see the advantages of dynamic peer discovery for discovery in many
use cases, even if not ours - our network of participating sites isn't
going to change very often. I do have a question about the delegation part
though.
What does the working group see as the advantages of having delegation
occur on the server/API side as opposed to having a server list (well-known
servers in our case, or obtained from the peer API) and having the client
perform multiple individual queries against those servers?
For our use cases, there are at least two important potential
advantages of pushing delegation into the API, but they would involve
features that as far as I know aren't on any roadmap and so currently
require us to build a layer on top of the GA4GH APIs. On the other hand if
there was appetite for having those be part of the GA4GH API that would
change how we approach some development tasks.
The delegated queries can certainly be more efficient (in elapsed time, if not necessarily in communications) when the operation is done in parallel through the tree rather than with a linear scan; that can be done there's a well-defined "combine results" operator. Those combination operations are of interest to us - having some simple forms of normalization occur on the server side. So say for instance the object the user wants to eventually work with is a genotype matrix over a region (which itself raises a bunch of api/schema issues currently). There will be rows corresponding to variants present in some data sets and not others, and having the resulting matrix be correctly combined on the server side rather than on the client would be of interest.
The other obvious potential for us concerns data privacy of various sorts. We're going to have cases where we don't want more individual data leaked than is necessary when performing some higher-level queries - if that can be done, then there are data sets that we could expose to our researchers at remote sites which otherwise would have to be completely private. Many of these sorts of queries are easily performed in the case there being some trusted third party - which could be the federated API if delegation exists, or a layer on top if not.
We have several work-in-progress proof of concepts, but will wait until those are more mature and tested to make any concrete proposals.
I would consider federation to be an additional feature some nodes will support in a later version. Please add your use cases here ga4gh/ga4gh-schemas#788
If we can construct a client query using this simple p2p protocol that satisfies the use case (although perhaps not perceived performance) I am fairly confident federation will fall out nicely.
Designing a beacon aggregation query (in parallel) should be an implementation detail. To a client the operations are the same and the aggregation beacon implementor can choose the extent of metadata provided about the oracle response. A deidentifying aggregator beacon that exposes the same protocol as a beacon-over-ga4gh might fit your case well. Imagine an API layer where you can design which strings will be removed or keys occluded, similar to what Google does when you try to search for a credit card number. Code that separates the deidentification concerns would greatly benefit the community.
I would like to constrain the features for this version by suggesting that delegation is an implementation detail and that successful examples of delegation will inform how we might perform federation. In that spirit here are three "beacon types" which differ only in implementation, they expose the same protocol.
"Whatever it takes to respond to the query over my data," which I show as DB silos in the diagram above. These are a mixture of scripts, SQL, and are will in practice not be very portable.
Beacon over Variants API which presents a very nice example of how to "upgrade" one's discovery connection. ( @ljdursi's network, BRCA-exchange )
The third is a beacon API aggregator, which presents the exact same interface as the above two but generates its oracle response by the logical OR (I said AND in the diagram mistakenly). It is up to the implementation to decide how much metadata is included in the aggregation. This is the notion behind https://github.com/knoxcarey/bob, however, last I checked it didn't present the beacon protocol, but a slight modification.
These are all different from the existing beacon-network software, which is an over-the-top application for crawling the nodes.
A fourth node worth type that @ljdursi is alluding to would mix the domain boundaries by implementing a beacon that queries multiple Genomics API servers to generate a single oracle response. For example, a beacon might aggregate the calls for a single gene by querying over ExAC and 1kgenomes using a SearchVariantsRequest, just for that gene. In principle we can go the other way as well, providing a variant set where each "sample" is a response from a given peer.
Also @ljdursi, your comments regarding accessing calls do not fall on deaf ears. I've taken a step towards ameliorating the problem in this PR, which tries to improve the access pattern. Please do share your concerns with us!
@jrambla commented on Tue Feb 07 2017
@david4096 commented on Tue Feb 07 2017
Suggest decentralized peer to peer over hub and spoke.
Each beacon can choose to know about some other set of beacons. Each beacon gets to maintain its list of known peers however it likes. With the most simple decentralized network architecture in place we can begin to discuss how one might federate a query to a beacon's peers.
One can still create hubs, or network bottlenecks by only setting the peers for a beacon to some central node. In a peer-to-peer protocol, this is still an option. Instead of developing an over-the-top protocol on the beacon API, we might consider adding to the spec the ability for beacons to list peers. This would lead to a lighter stack that has clear modes of participation.
We might not need to write new software, just present add simple interface for managing peers at the beacon level. The number of peers could be very small and still support a healthy network.
@antbro commented on Tue Feb 07 2017
+1
Its exactly how Cafe Variome works
Parallel discussion needed regarding user AuthN/AuthZ
Tony
Sent from my iPhone
On 7 Feb 2017, at 17:46, David Steinberg <[email protected]mailto:[email protected]> wrote:
Suggest decentralized peer to peer over hub and spoke.
Each beacon can choose to know about some other set of beacons. Each beacon gets to maintain its list of known peers however it likes. With the most simple decentralized network architecture in place we can begin to discuss how one might federate a query to a beacon's peers.
One can still create hubs, or network bottlenecks by only setting the peers for a beacon to some central node. In a peer-to-peer protocol, this is still an option. Instead of developing an over-the-top protocol on the beacon API, we might consider adding to the spec the ability for beacons to list peers. This would lead to a lighter stack that has clear modes of participation.
We might not need to write new software, just present add simple interface for managing peers at the beacon level. The number of peers could be very small and still support a healthy network.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubga4gh-beacon/specification#75 (comment), or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI_EVB3aoJFNKf5QvwvsWWSRIyxj87--ks5raJ_ggaJpZM4L5iok.
@david4096 commented on Thu Feb 23 2017
Really enjoyed the conversation this morning!
Wanted to give y'all something to chew on. This first diagram attempts to capture a little about organizing a p2p network architecture. In the center you see the backbone with Elixir, DNAStack, UCSC listing each other as peers. From there the individual subnets, including one for a private healthcare institution are shown. Thick lines show peers that accept any connection, dashed show cases where a peer only accepts requests from a whitelisted entry.
Note that all actual data flows over HTTP methods, there is no bittorrent or anything hiding here.
I hope that this shows how the various genomics APIs work together to create a service architecture that lets clients "upgrade" their connection from an oracle response to raw genomics data. ga4gh-beacon/specification#68
** NOTE ** I mean to say logical OR of results for the aggregator, not AND.
I also took a moment to gather the thoughts around how a GA4GH Network Explorer might work. This is similar in spirit to Marc's beacon registry, except it pulls from the known good nodes to construct its network representation.
@mfiume Could you post the open concerns that were raised in the call?
.nih.gov
would have had to have hacked their DNS.Of course, if you have any questions, just let me know!
@kozbo commented on Thu Feb 23 2017
Thanks David,
I enjoyed the discussion today too. There are a few different organizations within GA4Gh that need this capability. We are not the first ones to try to build a distributed network or to create an architecture compatible with federated queries. It makes sense to me to collaborate on this effort.
I agree with Jordi, it will help if we can list out the specific use cases we would like to meet via this design. I think Beacon came to this problem from already having a centralized solution implemented at DNAStack. They have some Beacon related use cases to get us started. The Genomics API also has some objectives in creating a directory. Let’s see if all the use cases are harmonious.
Let me make a first cut at the use cases that I know of (please add to these!)
Discovery
Be able to discover all nodes in the network of a particular type (Please list all the Genomics API nodes are there that support Variant and Read APIs) It is expected that a first-time user would use the GA4GH website to get started, but it could be started from any node connected via some route to GA4GH.
Be able to discover all nodes in the network of a particular type running a version >= x.y.z (Please list all the Genomics API nodes are there that support Variant APIs running SW >= 0.6.0) It is expected that a first-time user would use the GA4GH website to get started, but it could be started from any node connected via some route to GA4GH.
Generate a list of all nodes connected to the network (regardless of type or version). This would list out all the servers along with their info (what characteristics they have and what version they are running, plus whatever else is in the info field, URL for instance)
Delegation (the ability to delegate work to nodes in the graph)
Issue a job to a node and that node would then compute a result across all of it’s peers of a compatible type. This could be a search for across all Genomics API servers for any Biosamples that exist with a disease field set to a particular type of carcinoma. Note that the search would obviously be made against nodes that supported Biosamples, and were running a version of the API compatible with the starting node. One could make a similar Beacon request or a variant search request. One could also do a search for servers that have RNA expression data for a particular gene. There are many examples of the work this sort of general job descriptor could handle.
?
Implementation notes
(Sorry, I can’t help myself!)
I admit that I have already started to state the use cases with a particular implementation in mind. We have been working on a peer service (as we mentioned in the meeting). We wrote up a little document that describes the protocol which you can see here https://docs.google.com/document/d/1hc-l7P0S0G8j19n0dKV9e0AeF8zgE3I88fNveSC178w/edit#heading=h.1j4vc4ls6v7v (you may need me to give you permission to see it, if so then just let me know).
I think that we can separate the problem space here a bit. The communication protocol required to build a distributed peer network should be separated from the delegation functionality. Once we build the network then we can construct some form of map-reduction on top to distribute work and collect results. We will need to provide some logic to help implementers join and manage the network of interconnections in a way that provides a good basis for distributed workloads. There has been some interesting work done in this area https://gnunet.org/sites/default/files/gossip-podc05.pdf.
Thanks for reading this far, and please help with the use case list.
-Kevin
@ljdursi commented on Thu Feb 23 2017
So I'm part of a team (connecting institutes in Toronto, Vancouver, and Montreal) that's starting to build something like that network of hospitals in the bottom corner of David's first diagram, to perform national-scale analysis over private, locally-controlled health research data using GA4GH APIs. We're just getting up to speed now.
I can see the advantages of dynamic peer discovery for discovery in many use cases, even if not ours - our network of participating sites isn't going to change very often. I do have a question about the delegation part though.
What does the working group see as the advantages of having delegation occur on the server/API side as opposed to having a server list (well-known servers in our case, or obtained from the peer API) and having the client perform multiple individual queries against those servers?
For our use cases, there are at least two important potential advantages of pushing delegation into the API, but they would involve features that as far as I know aren't on any roadmap and so currently require us to build a layer on top of the GA4GH APIs. On the other hand if there was appetite for having those be part of the GA4GH API that would change how we approach some development tasks.
@kozbo commented on Thu Feb 23 2017
Hi Jonathan,
The main advantage of delegation is when you have a large number of servers
to control. Having a single client machine collect all the data from 100 or
more systems is not very efficient. That solution follows the start network
pattern, and is fine for small numbers of nodes. The planning for the
delegation feature is a way for us to look forward and plan for an
efficient information gathering network in the future when we have 1000's
of nodes (or more).
I am interested to know your use case for delegation and what capability
you are building on top of the GA4GH APIs. Have you proposed any
enhancements to support your needs?
Thanks,
-Kevin
On Thu, Feb 23, 2017 at 8:54 AM, Jonathan Dursi [email protected]
wrote:
@ljdursi commented on Thu Feb 23 2017
Hi, Kevin:
The delegated queries can certainly be more efficient (in elapsed time, if not necessarily in communications) when the operation is done in parallel through the tree rather than with a linear scan; that can be done there's a well-defined "combine results" operator. Those combination operations are of interest to us - having some simple forms of normalization occur on the server side. So say for instance the object the user wants to eventually work with is a genotype matrix over a region (which itself raises a bunch of api/schema issues currently). There will be rows corresponding to variants present in some data sets and not others, and having the resulting matrix be correctly combined on the server side rather than on the client would be of interest.
The other obvious potential for us concerns data privacy of various sorts. We're going to have cases where we don't want more individual data leaked than is necessary when performing some higher-level queries - if that can be done, then there are data sets that we could expose to our researchers at remote sites which otherwise would have to be completely private. Many of these sorts of queries are easily performed in the case there being some trusted third party - which could be the federated API if delegation exists, or a layer on top if not.
We have several work-in-progress proof of concepts, but will wait until those are more mature and tested to make any concrete proposals.
@david4096 commented on Thu Feb 23 2017
I would consider federation to be an additional feature some nodes will support in a later version. Please add your use cases here ga4gh/ga4gh-schemas#788
If we can construct a client query using this simple p2p protocol that satisfies the use case (although perhaps not perceived performance) I am fairly confident federation will fall out nicely.
Designing a beacon aggregation query (in parallel) should be an implementation detail. To a client the operations are the same and the aggregation beacon implementor can choose the extent of metadata provided about the oracle response. A deidentifying aggregator beacon that exposes the same protocol as a beacon-over-ga4gh might fit your case well. Imagine an API layer where you can design which strings will be removed or keys occluded, similar to what Google does when you try to search for a credit card number. Code that separates the deidentification concerns would greatly benefit the community.
I would like to constrain the features for this version by suggesting that delegation is an implementation detail and that successful examples of delegation will inform how we might perform federation. In that spirit here are three "beacon types" which differ only in implementation, they expose the same protocol.
"Whatever it takes to respond to the query over my data," which I show as DB silos in the diagram above. These are a mixture of scripts, SQL, and are will in practice not be very portable.
Beacon over Variants API which presents a very nice example of how to "upgrade" one's discovery connection. ( @ljdursi's network, BRCA-exchange )
The third is a beacon API aggregator, which presents the exact same interface as the above two but generates its oracle response by the logical OR (I said AND in the diagram mistakenly). It is up to the implementation to decide how much metadata is included in the aggregation. This is the notion behind https://github.com/knoxcarey/bob, however, last I checked it didn't present the beacon protocol, but a slight modification.
These are all different from the existing beacon-network software, which is an over-the-top application for crawling the nodes.
A fourth node worth type that @ljdursi is alluding to would mix the domain boundaries by implementing a beacon that queries multiple Genomics API servers to generate a single oracle response. For example, a beacon might aggregate the calls for a single gene by querying over ExAC and 1kgenomes using a SearchVariantsRequest, just for that gene. In principle we can go the other way as well, providing a variant set where each "sample" is a response from a given peer.
Also @ljdursi, your comments regarding accessing calls do not fall on deaf ears. I've taken a step towards ameliorating the problem in this PR, which tries to improve the access pattern. Please do share your concerns with us!
@juhtornr commented on Fri Mar 31 2017
Current specification is in Google Docs: https://docs.google.com/document/d/1yanuZLTZkp_rPvECQc4VWbURwhLL56bpAYDh31lq6_0/edit#heading=h.uqx149oqe44w
The text was updated successfully, but these errors were encountered: