-
Notifications
You must be signed in to change notification settings - Fork 264
Scaling to 1M
This page presents a methodology for estimating BrowserID capacity, and presents a plan for how to scale out capacity to handle 1M active users
An "active user" for the purposes of these estimates is a user with two devices, who uses 4 different browserid enabled sites, and visits her sites about 40 times a day evenly spread out over site and device. Users are required to re-authenticate to BrowserID once a week per device. Finally, "typical sites" remember user authentication for about 6 hours before requiring re-authentication via browserid. Finally, we expect about a 20% growth rate per month.
For the purposes of this discussion, there are 6 interesting types of activities that occur on browserid:
"signup" - A user creates a new browserid "account" via the dialog "reset_pass" - A user resets their password through the dialog "add_email" - A user adds a new email to their browserid "account" "reauth" - A user re-authenticates to browserid "signin" - A user uses browserid to sign into a site. "include_only" - An RP includes include.js.
Note that these activities overlap. Specifically, a "signin" activity is the least costly of them and occurs inline as part of all of the other activites above it.
The above parameters suggest that 1M active users corresponds to: 2.10 signup activites per second 0.42 reset_pass activites per second 0.84 add_email activites per second 1.68 reauth activites per second 93.93 signin activites per second
First, This is guesswork and estimation. The goals here are to guess at conservative initial hardware requirements and to have a means of quantifying the performance of the system. Once we have significant usage of the system we can use live data as a basis of capacity planning rather than simulated load.
Second, there are many changes planned and being made that will change the cost of the various activities. The most important upcoming change is a move to certs, which will increase the cost of assertion verifcation, and generally will reduce disk load and network traffic while increasing computation requirements. So we will need to sanity check our capacity planning as development progresses, and a reasonable way to do that might be to run load generation periodically against reference hardware as a way of tracking performance over time and pre-emptively catching software changes which will significantly affect the system load.
At present there are three services that constitue a BrowserID node capable of serving static resources and dynamic WSAPI calls:
- mysql - persistence
- nginx - reverse proxy for node.js api server which handles all static requests
- node.js server - which handles all api calls.
Additionally there is the verifier, which is itself a node.js server.
In the current production all four of these services are run on a single node, which is in turn behind a load balancer.
Static resources served from browserid servers fall into three groups:
- include.js served to RPs at the time their page with a login button is loaded
- dialog resources served to RPs at the time that a user clicks on a login button on
- site resources served to visitors of browserid.org (using the manage page or viewing the site.
Generally all of these resources are to be completely static. They can be cached and served by nginx or even higher levels of middleware. If/when required we can offload them to an SSL CDN. The question is when to do this?
So how much bandwidth will be consumed by static resources:
-
include.js is 4k minified and gzipped. It will be served to about 450 active browserid users per second by RPs that use browserid. Further, it will be served even more frequently by these same RPs as very likely only a subset of their users will authenticate with BrowserID. Assume that 10% of users of RPs that use browserid authenticate with browserid. That reasoning yields:
4k * 450 * 10 = 18M/sec
-
dialog resources are 100k minified and gzipped. these will be served to about 100 users per second.
100k * 100 = 10M/sec
-
the manage page is also about 100k of resources. 4 users per second will be directed these resources as part of the sign in flow, and additionally can conservatively estimate organic manage page usage to be about 5% of the traffic to dialog.
100k * 4 + 10M * .05 = 900k/sec
Given these estimates, with a million active, we can expect about 30M/sec of traffic to be static resource requests. This estimate ignores browser caching and assumes we need to serve all static resources afresh.
The only work to do here is to ensure all static resources are properly cached in the nginx layer with correct cache headers. Further that resources are cached and gzipped properly.
The verifier service is particularly hard to estimate at this point given how different the implementation is today, vs how it will look once we implement certificates.
The current implementation performs a network request to attain the user's public key from browserid servers, then uses that key to cryptographically verify the assertion.
Without the network call (which will not be present once certificates are implemented) verification of an assertion takes on the order of 15ms on a 2.66Ghz Core 2 processor.
Assuming that a move to certificates will triple the cost of verification, we can expect that each verification will take about 45ms.
With 1M active we should expect about 100 verification actions per second, which will require 4.5 cores to satisfy. If we want to run the service at 20%, that's about 22 cores.
The other consideration to verification is that this is one part that will move off of our service and onto RPs, which is beneficial to them as it limits their reliance on external servers, probably decreases latency for users, and at smaller scale has a much better maintenence cost.
The other consideration here is that this is one area where a better native implementation can have a huge impact.
Open work for the verifier is to ensure that it can saturate all available cores. This might be as simple as running multiple instances on different ports and letting nginx or the load balancer round robin.
Not considering verifier calls, a simulated load of 1M active users corresponds to about 70% cpu utilization of a single 2.66 Ghz Core 2 processor. It corresponds to about 800k/sec average disk writes and 300k/sec disk reads.
Of this processor usage over 90% is spent in bcrypt at 10 rounds. If we move to 12 rounds, we can't quite support 1M active on two cores of 2.66Ghz with the present implementation (which confirms the 70% number above).
Further, with the system under load simulation 1M active on two processors the average WSAPI call delay for an authentication request is 1.5s at 12 rounds and about 360ms at 10.
Finally, for reference a password check takes 170ms of raw compute time on said processors for 10 rounds, and 760ms for 12 rounds.
The conclusion for WSAPI calls is that we're compute bound in bcrypt, and depending on our choice of rounds, the scaling requirements will change. We'll need about 3 cores to run around 20% capacity with 10 rounds and 12 cores for the same 20% target at 12 rounds.
We've agreed to outsource email to a different vendor. Given the numbers above, we can expect to send about 4 emails per second per million active users.
With 1M active users we run 33 QPS against MYSQL. This is a completely managable quantity. All we should do in this area is load up the database with sample data and figure out what our size requirments are. Also we should go through every distinct type of query and ensure that there are no table scans and figure out what a reasonable upper bound is on a typical node.
Outside of scaling to multiple nodes for redundancy, the mysql work to get to 1M is largely just making sure we haven't done anything stupid.
All of these actions have been entered in the issue tracker against the "scaling to 1M" milestone, and here they are broken down. We should:
- minify include.js: https://github.com/mozilla/browserid/issues/206
- serve static resources with proper cache headers: https://github.com/mozilla/browserid/issues/207
- gzip all static resources in nginx: https://github.com/mozilla/browserid/issues/207 (yeah, same issue as above)
- determine size requirements for mysql with 1M active: https://github.com/mozilla/browserid/issues/208
- audit all SQL queries performed and ensure they hit indicies: https://github.com/mozilla/browserid/issues/209
- simulate a large database and establish a very conservative upper bound on QPS as a sanity check. https://github.com/mozilla/browserid/issues/210
- figure out the configuration of multiple mysql machines.
https://github.com/mozilla/browserid/issues/211 - decide how many rounds we want to run with bcrypt. https://github.com/mozilla/browserid/issues/212
- support dynamic bcrypt work factor update: https://github.com/mozilla/browserid/issues/204
- ensure that the verifier can saturate multiple cores either via libraries or deployment configuration. https://github.com/mozilla/browserid/issues/213
- move to socketlabs for email delivery https://github.com/mozilla/browserid/issues/214
With 1M active users we:
- serve 30M/sec in static resources
- run 33 QPS through mysql
- saturate 4.5 cores with verification requests
- saturate .7 cores with WSAPI requests w/ 10 round bcrypt OR saturate 2.1 cores with WSAPI requests w 12 round bcrypt
For the purposes of getting to 1M, I would suggest that we acquire at least 4 machines in at least 2 colos having at least 24 cores between them. 3 4-core boxes each in 2 colos or 2 8-core boxes in 2 colos both would work equally well.
I suggest for simplicities sake we continue to run multiple nodes which run all services on them, and later we can incrementally move distinct services (like database or verification) with unique requirements onto more optimized clusters.
To repeat myself, this plan should hold us over long enough that we can make future decisions based on emperical data rather than lloyd's hokey load generation.
Finally, I have been careful to avoid proposing optimizations as a required part of the plan to get to 1M, however I think we can invest time in several different areas and improve our cost model.