Skip to content

Document store requirements gathering

Duke Leto edited this page Sep 10, 2013 · 11 revisions

Software (any replicate) potential requirements

  1. Must be able to host at least 20000 (?) mutable objects ranging in size up to 100 Mbyte (when uncompressed) (? we talked about carving up big objects). (currently about 3000, we know of about 8000 studies out there. current largest is 32M? -- they could grow considerably with annotations... Automatic annotating could cause very rapid file size increase on large trees. Not sure what this implies as far as an upper limit, but something to think about.)
  2. Among these objects (perhaps exclusively?) are study-objects (currently represented as NexSON files = Badgerfish applied to NeXML).
  3. Study-object format should be documented and extensible, and should retain backward compatibility
  4. Must have basic version control features: past versions of each object (indefinitely or limited to a maximum age?); commits (change sets) and commit messages; committer identification and timestamps; commit tags.
  5. Two modes of access: (a) 'raw' mode (read or write object using platform-specific HTTP calls--how do we deal with concerns about formatting changes causing file rewrites even for minor edits?), (b) 'layered' mode (read or write using opentree-specific(?) platform-independent HTTP calls).
  6. Deploying 'layered' mode should be straightforward - decoupled from other services
  7. Should be attractive to developers
    1. Familiar technologies
    2. Automated test suites
    3. Low-overhead testing

Hosting (production instance) potential requirements

  1. Survivability: Objects must be accessible in 'raw' mode even when there are no opentree-managed servers running. (That is, everyone on the opentree project can disappear or become delinquent or broke, and the wider world will still be able to access or fork the content; and also change the content (as hosted), if there is a cooperating party with sufficient privileges. People will always be able to access the raw JSON data via raw.github.com
  2. 'Raw' mode should have high availability, high bandwidth, low latency
  3. Plan needs to be in place for contingency of destitution (e.g., 'raw' service could be no-cost)
  4. Plan needs to be in place in case of problem with hosting service: if it ends operation, changes incompatibly, changes price so that service becomes unaffordable, or quality degrades to an unacceptable level
  • Authentication - how will users/curators identify themselves?
  • How does the datastore integrate with the OTOL backup strategy?
  • How will updates to the production system be deployed?
Clone this wiki locally