Information useful for running a TdS stage.
Buildkite jobs are used to set up and tear down the Solana nodes on the TdS cluster, and configure some genesis block settings. Most of the relevant settings can be set at environment variables within a buildkite job, so we can change and re-deploy rapidly without needing to touch CI script code.
Run a job of the TdS-enable pipeline. Click 'New Build' and run with default settings
https://buildkite.com/solana-labs/tds-enable
This only needs to be run once per CI instance, or after the TdS cluster has been disabled with the TdS-delete-disable pipeline.
To create a new cluster, use the TdS-create-and-start buildkite job:
https://buildkite.com/solana-labs/tds-create-and-start
The pipeline will pull the tip of the v0.16 branch of code for scripts and binaries by default. All of the default configuration settings for the cluster can be found in the pipeline settings:
https://buildkite.com/solana-labs/tds-create-and-start/settings
Any of the above values can be overwritten for a particular build, by using [key]=[value]
syntax (do not use double quotes for the value here) under Environment Variables when you click 'New Build'.
The TDS_ZONES
, TDS_NODE_COUNT
and TDS_CLIENT_COUNT
must have a valid value if the default is not used. All other variables may set the value to skip
to disable the given configuration and the system default behavior will be used.
NOTE: Using the STAKE_INTERNAL_NODES
setting (including the default value in this pipeline) disables airdrops for the cluster.
Example: Enter the following in the Environment Variables box for a New Build to have only CPU-only validator node in a single region. No clients and no GPUs.
TDS_ZONES=us-west1-a
TDS_CLIENT_COUNT=0
ENABLE_GPU=skip
Example: If you want to run the cluster on the tip of master instead of the v0.16 branch, in the New Build window, set the Branch field to master
and add the following to the Environment Variables:
TESTNET_TAG=edge
Example: To enable airdrops, add the following in Environment Variables:
STAKE_INTERNAL_NODES=skip
To restart the binary software on the nodes without deleting and re-creating the instances, use the TdS-restart pipeline
https://buildkite.com/solana-labs/tds-restart
The number of nodes, GPUs, and the zones cannot be changed from this pipeline but new settings for the genesis block can be provided. The following settings can be changed when restarting the cluster/ledger:
HASHES_PER_TICK: "auto"
STAKE_INTERNAL_NODES: "1000000000000"
EXTERNAL_ACCOUNTS_FILE_URL: "https://raw.githubusercontent.com/solana-labs/tour-de-sol/master/stage1/validator.yml"
LAMPORTS: "8589934592000000000"
As new stage participants are registered for a given stage, their keybase username should be added to one of the keybase-username files, one keybase username per line:
validators/keybase-usernames.internal
- Solana internalvalidators/keybase-usernames.us
- us-based validatorsvalidators/keybase-usernames.earth
- earth-based validators, excluding us.
Then prior to the start of the stage, run ./import-keybase-usernames.sh
to import
all public keys each validator has published and commit the modifications to
validators/*.yml
$ ./bench-tps.sh
Fetching the TdS cluster configuration can be accomplished with:
$ export CLOUDSDK_CORE_PROJECT=tour-de-sol
$ net/gce.sh config -p tds-solana-com -z us-west1-a -z us-central1-a -z europe-west4-a
at which point all the normal net/
functionality becomes available (such as net/ssh.sh
). Also net/net.sh logs
can be used to collect logs off the nodes
Work in progress
The following steps can be used to perform a ledger rollback if needed:
- Identify the desired slot height to roll back to
- Announce to all participants that a rollback is occuring, and request that everybody shut down their validators
- Stop the Solana TdS nodes:
./net stop
- On the tds.solana.com bootstrap-leader node, run the following steps to generate a rollback list
$ solana-ledger-tool --ledger ${path_to_ledger} list-roots --max-height ${rollback_slot_height} --slot-list ./rollback.txt
$ solana-ledger-tool --ledger ${path_to_ledger} prune --slot-list rollback.txt
# The output should look something like this
Prune at slot 5000 hash "HRQnaDnSoaeM5xQKxjKYbU53ZFhTYtjBS7HWyG3Q1JUq"
- Bring the Solana TdS nodes back up with
./net start --no-deploy --no-snapshot --skip-ledger-verify -r
- Announce to all participants that a rollback has been completed, they should now delete their ledger and restart their validator from a new snapshot
- Fetch the TdS cluster configuration
- Set bash vars for the network
$ eval $(net/gce.sh info --eval)
- Snag the faucet keypair from the bootstrap leader
$ net/scp.sh solana@"$NET_VALIDATOR0_IP":solana/config/faucet-keypair.json .
- Optionally set Slack and Discord webhook env vars to be notified of progress
export SLACK_WEBHOOK=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
export DISCORD_WEBHOOK=https://discordapp.com/api/webhooks/<ID>/<TOKEN>
- Wait for all validators to connect to the cluster
- Run
destake-net-nodes.sh
to remove the large initial stake from the Solana TdS nodes - Start the ramp-up TPS tool
$ cargo run -p solana-ramp-tps -- -n $NET_VALIDATOR0_IP \
--net-dir <solana/net> \
--initial-balance 1 \
--round-minutes 20 \
--tx-count-baseline 1000 \
--tx-count-increment 2000 \
--stake-activation-epoch 9 \
--faucet-keypair-path <faucet_keypair.json>
If the tool fails, it may be possible to recover and pickup where it last left off. The only unsupported scenario is when the tool fails in the middle of awarding stake to the surviving validators.
- If the tool failed during bench-tps, recovery is simple. Simply start
the tool at the
round
number which failed. - If the tool fails during stake warmup, specify both the TPS
round
number as well as the epoch when the stake started activating (stake-activation-epoch
).
$ cargo run -p solana-ramp-tps -- -n $NET_VALIDATOR0_IP \
--net-dir <solana/net> \
--initial-balance 1 \
--round <START ROUND> \
--round-minutes 15 \
--tx-count-baseline 5000 \
--tx-count-increment 5000 \
--stake-activation-epoch <LAST STAKE ACTIVATION EPOCH> \
--faucet-keypair-path <faucet_keypair.json>
The ramp up tool will be following this process:
- Download the genesis block
- Wait for warm up epochs to pass
- Start ramp up cycle
- Wait for validator stakes to warm up
- Run solana-bench-tps on clients
- Sleep until the round is finished
- Stop solana-bench-tps
- Fetch top performing validators
- Gift stake to the top validators
- Double gift and increment TPS