Failover and migrate Validator Nodes for less downtime
Why we need to migrate
Even if it is not due to the service providerβs policy, we need to migrate our validator nodes sometime for
-
Validator node failover. e.g. Hardware failure or network outage
-
Necessary maintenance reboot. e.g. Apply security fixes for the kernel
-
Scale the node. e.g. Larger storage or more CPU to meet new hardware requirements
The most feasible method
As a community node operator, we run a validator node and a VFN(validator full node).
The data is actually almost the same for two nodes, the VFN is missing the consensus_db and secure-data.json but there is is no issue when converting it into a validator node.
The file structure of docker volume is as follows
ubuntu@validator /var/lib/docker/volumes # tree -L 4
.
βββ aptos-validator
β βββ _data
β βββ db
β β βββ consensus_db
β β βββ ledger_db
β β βββ state_merkle_db
β β βββ state_sync_db
β βββ secure-data.json
βββ backingFsBlockDev
βββ metadata.db
ubuntu@VFN /var/lib/docker/volumes # tree -L 4
.
βββ aptos-fullnode
β βββ _data
β βββ db
β βββ ledger_db
β βββ state_merkle_db
β βββ state_sync_db
βββ backingFsBlockDev
βββ metadata.db
Practical Implementation
If we want to achieve fast migrate, of course we have to put the validator node files on the VFN and vice versa. The configuration files on my validator node and VFN are shown below
ubuntu@validator ~/mainnet-vn $ tree
.
βββ blocked.ips
βββ docker-compose-fullnode.yaml #extra file
βββ docker-compose.yaml
βββ fullnode.yaml #extra file
βββ genesis.blob
βββ haproxy.cfg
βββ haproxy-fullnode.cfg #extra file
βββ keys
β βββ validator-full-node-identity.yaml #extra file
β βββ validator-identity.yaml
βββ validator.yaml
βββ waypoint.txt
ubuntu@VFN ~/mainnet-fn $ tree
.
βββ blocked.ips
βββ docker-compose-validator.yaml #extra file
βββ docker-compose.yaml
βββ fullnode.yaml
βββ genesis.blob
βββ haproxy.cfg #extra file
βββ haproxy-fullnode.cfg
βββ keys
β βββ validator-full-node-identity.yaml
β βββ validator-identity.yaml #extra file
βββ validator.yaml #extra file
βββ waypoint.txt
Where the docker-compose-validator.yaml file we can copy from the validator node then modify the volume name from aptos-validator to aptos-fullnode in 3 places and vice versa.
volumes:
- type: volume
source: aptos-fullnode
target: /opt/aptos/data
volumes:
aptos-fullnode:
name: aptos-fullnode
Once we have the files ready then we can modify the DNS resolution, for ppl using IP addresses can run aptos node update-validator-network-addresses
before the epoch is about to change, then open two terminal and ssh to nodes separately.
Simply stopped the validator node with docker compose down
then stop the VFN and run docker with validator node yaml, It is finished. Just as fast as updating the docker image.
docker compose down && mv docker-compose.yaml docker-compose-fullnode.yaml && mv docker-compose-validator.yaml docker-compose.yaml && docker compose up -d
Even if new DNS record is not in effect, the outbound connection can participate in consensus and proposals. Or you can use the following command, but I prefer the longer former to avoid forgetting the -f for subsequent operations
docker compose down && docker compose -f docker-compose-validator.yaml up -d
For community nodes, we can even open an extra terminal to observe the number of proposals and migrate immediately after completing a new proposal since the number of proposals in each epoch is small. If the interval between two proposals is only a few seconds then you are not lucky
watch -n 1 "curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_proposals_count""
Creating a new full node instead using current VFN
Of course creating a new fullnode and converting it to a validator node is also an choice. You can simply copy all the files on the VFN and remember to modify the validator-full-node-identity.yaml file , this will work.
Jing: Hi, for folks running more than one vfn, please make sure each vfn has a unique identity or it causes issues with our telemetry. If itβs hard for you to set things up that way, please turn off the additional vfns until you are able to sort it out
And since we donβt need to register this node on-chain, we can just change the last few positions of account_address and network_private_key to random hex.