Skip to content

Document store requirements gathering

Duke Leto edited this page Sep 10, 2013 · 11 revisions

Software (any replicate) potential requirements

  1. Must be able to host at least 100,000 mutable objects ranging in size up to 50MB (currently about 2600, we know of about 8000 studies out there. current largest is ~24MB -- they could grow considerably with annotations... Automatic annotating could cause very rapid file size increase on large trees. Not sure what this implies as far as an upper limit, but something to think about.)

  2. Among these objects (perhaps exclusively?) are study-objects (currently represented as NexSON files = Badgerfish applied to NeXML).

  3. Study-object format should be documented and extensible, and should retain backward compatibility

  4. Must have basic version control features: past versions of each object (indefinitely or limited to a maximum age?); commits (change sets) and commit messages; committer identification and timestamps; commit tags.

  5. Three modes of access:

    1. 'raw' mode which allows access to the backend datastore via the Git protocol
    2. 'visitor' mode allows people to view/download the raw data via HTTP/HTTPS
    3. 'developer' mode : a web API at api.opentreeoflife.org which speaks JSON
  6. Deploying api.opentreeoflife.org should be straightforward - decoupled from other services

  7. Should be attractive to developers

    1. Familiar technologies
    2. Automated test suites
    3. Low-overhead testing

Hosting (production instance) potential requirements

  1. Survivability: Objects must be accessible in 'raw' mode even when there are no opentree-managed servers running. (That is, everyone on the opentree project can disappear or become delinquent or broke, and the wider world will still be able to access or fork the content; and also change the content (as hosted), if there is a cooperating party with sufficient privileges. People will always be able to access the raw JSON data via raw.github.com
  2. 'Raw' mode should have high availability, high bandwidth, low latency
  3. Plan needs to be in place for contingency of destitution (e.g., 'raw' service could be no-cost)
  4. Plan needs to be in place in case of problem with hosting service: if it ends operation, changes incompatibly, changes price so that service becomes unaffordable, or quality degrades to an unacceptable level
  • Authentication - how will users/curators identify themselves? ** Will we always allow anonymous read access? Anonymous full-database scraping is effectively a DDoS, so we will need policies to bandwidth and api-call throttle greedy folks. ** For write access, how will we manage the creation, storage and expiration of API tokens?
  • How does the datastore integrate with the OTOL backup strategy?
  • How will updates to the production system be deployed?
Clone this wiki locally