Skip to content

Commit

Permalink
Rewrite language
Browse files Browse the repository at this point in the history
  • Loading branch information
joaniefromtheblock authored Aug 19, 2024
1 parent 8dfc167 commit 34538bf
Showing 1 changed file with 30 additions and 22 deletions.
52 changes: 30 additions & 22 deletions docs/how-to/run-at-scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,44 +7,52 @@ sidebar_position: 8
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Running Web3Signer at Scale
# Running Web3Signer at scale
When running Web3Signer at scale with hundreds or thousands of keys, several factors affect attestation performance on validators.
Horizontal scaling reduces request latency on Web3Signer. To maintain low signing latency and high safety, connect multiple Web3Signer instances to the same slashing database.

There are a few key things to be aware of when running Web3Signer at scale. If you are managing hundreds or thousands of keys, these can help with attestation performance on your validators.
The primary performance cost occurs during startup. More keys increase Web3Signer's startup time, representing a one-time cost per restart.

![architecture-diagram](../../static/img/transparent_background_diagram.png)*Credit: Kiln. Architecture diagram of Web3Signer at scale*
When configuring your environment, consider the startup delay, the number of keys managed, and available system resources.

Horizontal scale helps with request latency on Web3Signer. Connecting these instances to the same slashing database will ensure low signing latency with high safety, but there are some things to consider when configuring your environment. The majority of cost is bared at startup, with more keys causing Web3Signer to take longer to start up. This is a one time cost (per restart).
Balancing these factors optimizes system performance and responsiveness. Regular monitoring and tuning are necessary as the number of managed keys grows or network conditions change.

## Database Proximity
## Database proximity

The [slashing database](./configure-slashing-protection.md) is crucial to managing many validators safely. Reducing latency and overhead on this database will greatly improve performance.
The [slashing database](./configure-slashing-protection.md) ensures the safe management of multiple validators. Optimizing the slashing database reduces latency and overhead, improving overall system performance.

* **Reduced Geographic Latency**: Strategically place Web3Signer instances to ensure minimal distance to the slashing protection database.
* **Performance Tuning**: Optimize database configurations for rapid access, considering factors like indexing and connection pooling.
- **Reduced geographic latency**: Strategically place Web3Signer instances to ensure minimal distance to the slashing protection database.
- **Performance tuning**: Optimize database configurations for rapid access, considering factors like indexing and connection pooling.

## Threading Model Optimization
## Threading model optimization

Web3Signer's threading framework, [Vertx](https://vertx.io/docs/vertx-core/java/), while powerful, is not suitable for all environments without configuration. If you are experiencing request latency or blocked threads, consider adjusting the [worker pool size](../reference/cli/options.md#vertx-worker-pool-size).
Web3Signer uses [Vertx](https://vertx.io/docs/vertx-core/java/) as its threading framework. While powerful, Vertx requires proper configuration for optimal performance in different environments. If you encounter request latency or blocked threads, adjust the [worker pool size](../reference/cli/options.md#vertx-worker-pool-size).

* **Concurrency Management**: Tailor the size of Web3Signer's thread pool to the expected load, preventing bottlenecks. If you see decreased attestation performance while the signing load is at its peak, increase the pool size.
* **Dynamic Adjustments**: Implement monitoring tools to adjust threads in real-time based on current demand and workload. You should measure spikes and adjust the pool accordingly. We provide two [metrics](./monitor/metrics.md) that may help here:
* http_vertx_worker_queue_delay: time spent by requests in queues before being processed
* http_vertx_worker_pool_completed_total : number of queries processed by Web3Signer
To manage concurrency, tailor Web3Signer's thread pool size to your expected load. Increase the pool size if you observe decreased attestation performance during peak signing loads.

## Load Balancing
You can implement monitoring tools for dynamic thread adjustments based on current demand and workload. Measure spikes and adjust the pool accordingly.

You can use the following [metrics](./monitor/metrics.md):

At scale, you will likely need multiple instances of Web3Signer connected to a load balancer. You want to ensure a balanced distribution of requests.
- `http_vertx_worker_queue_delay`: The request queue waiting time before processing.
- `http_vertx_worker_pool_completed_total`: The number of queries processed by Web3Signer.

## Load Balancing

* **Request Distribution**: Apply an ingress load balancer to ensure requests are evenly spread, preventing overloading of single instances. Using the same slashing database allows for replicas of Web3Signer to sign in parallel without slashing risk.
At scale, deploy multiple Web3Signer instances behind a load balancer. This setup ensures balanced request distribution.
Use an ingress load balancer to spread requests evenly across instances. This prevents overloading of single instances.
Connect all Web3Signer instances to the same slashing database. This allows parallel signing without slashing risk.

Credit to our friends at [Kiln](https://www.kiln.fi/) for this [helpful article](https://www.kiln.fi/post/learnings-from-running-web3signer-at-scale-on-holesky).
For more information, see the [Kiln article](https://www.kiln.fi/post/learnings-from-running-web3signer-at-scale-on-holesky) on running Web3Signer at scale.

## Hardware Recommendations
## Hardware recommendations

We test our nodes with 10k keys on various testnets like Goerli. In an example, we use a single cloud VM instance running Besu, Teku, and Web3Signer on an Azure Standard D8as v5 (8 vCPUs, 32 GiB memory). This may be more than needed, depending on your usage (see the example dashboard below).
Tests run nodes managing 10,000 keys on various testnets. For example, a single Azure Standard D8as v5 VM (8 vCPUs, 32 GiB memory) can host Besu, Teku, and Web3Signer simultaneously.
Your specific use case may require less powerful hardware.

![Dashboard for Web3Signer](../../static/img/dashboard_hw.png)

The Web3Signer process alone, with 10K keys uses < 2GB of JVM heap.
Web3Signer consumes less than 2GB of JVM heap while managing 10,000 keys in this setup.

Here we have assumed only one validator client connecting to Web3Signer. Multiple VCs may change requirements, although the number of requests should be the same if it's the same 10K keys partitioned across multiple clients.
The test configuration connects one validator client to Web3Signer. Using multiple validator clients may change resource requirements.
Distributing the same 10,000 keys across multiple clients maintains the total number of requests to Web3Signer.

0 comments on commit 34538bf

Please sign in to comment.