Failover and migrate Validator Nodes for less downtime
Why we need to migrate
Even if it is not due to the service provider’s policy, we need to migrate our validator nodes sometime for
Validator node failover. e.g. Hardware failure or network outage
Necessary maintenance reboot. e.g. Apply security fixes for the kernel
Scale the node. e.g. Larger storage or more CPU to meet new hardware requirements
The most feasible method
As a community node operator, we run a validator node and a VFN(validator full node).
The data is actually almost the same for two nodes, the VFN is missing the consensus_db and secure-data.json but there is is no issue when converting it into a validator node.
The file structure of docker volume is as follows
ubuntu@validator /var/lib/docker/volumes # tree -L 4 . ├── aptos-validator │ └── _data │ ├── db │ │ ├── consensus_db │ │ ├── ledger_db │ │ ├── state_merkle_db │ │ └── state_sync_db │ └── secure-data.json ├── backingFsBlockDev └── metadata.db
ubuntu@VFN /var/lib/docker/volumes # tree -L 4 . ├── aptos-fullnode │ └── _data │ └── db │ ├── ledger_db │ ├── state_merkle_db │ └── state_sync_db ├── backingFsBlockDev └── metadata.db
If we want to achieve fast migrate, of course we have to put the validator node files on the VFN and vice versa. The configuration files on my validator node and VFN are shown below
ubuntu@validator ~/mainnet-vn $ tree . ├── blocked.ips ├── docker-compose-fullnode.yaml #extra file ├── docker-compose.yaml ├── fullnode.yaml #extra file ├── genesis.blob ├── haproxy.cfg ├── haproxy-fullnode.cfg #extra file ├── keys │ ├── validator-full-node-identity.yaml #extra file │ └── validator-identity.yaml ├── validator.yaml └── waypoint.txt
ubuntu@VFN ~/mainnet-fn $ tree . ├── blocked.ips ├── docker-compose-validator.yaml #extra file ├── docker-compose.yaml ├── fullnode.yaml ├── genesis.blob ├── haproxy.cfg #extra file ├── haproxy-fullnode.cfg ├── keys │ ├── validator-full-node-identity.yaml │ └── validator-identity.yaml #extra file ├── validator.yaml #extra file └── waypoint.txt
Where the docker-compose-validator.yaml file we can copy from the validator node then modify the volume name from aptos-validator to aptos-fullnode in 3 places and vice versa.
volumes: - type: volume source: aptos-fullnode target: /opt/aptos/data volumes: aptos-fullnode: name: aptos-fullnode
Once we have the files ready then we can modify the DNS resolution, for ppl using IP addresses can run
aptos node update-validator-network-addresses before the epoch is about to change, then open two terminal and ssh to nodes separately.
Simply stopped the validator node with
docker compose down then stop the VFN and run docker with validator node yaml, It is finished. Just as fast as updating the docker image.
docker compose down && mv docker-compose.yaml docker-compose-fullnode.yaml && mv docker-compose-validator.yaml docker-compose.yaml && docker compose up -d
Even if new DNS record is not in effect, the outbound connection can participate in consensus and proposals. Or you can use the following command, but I prefer the longer former to avoid forgetting the -f for subsequent operations
docker compose down && docker compose -f docker-compose-validator.yaml up -d
For community nodes, we can even open an extra terminal to observe the number of proposals and migrate immediately after completing a new proposal since the number of proposals in each epoch is small. If the interval between two proposals is only a few seconds then you are not lucky
watch -n 1 "curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_proposals_count""
Creating a new full node instead using current VFN
Of course creating a new fullnode and converting it to a validator node is also an choice. You can simply copy all the files on the VFN and remember to modify the validator-full-node-identity.yaml file , this will work.
Jing: Hi, for folks running more than one vfn, please make sure each vfn has a unique identity or it causes issues with our telemetry. If it’s hard for you to set things up that way, please turn off the additional vfns until you are able to sort it out
And since we don’t need to register this node on-chain, we can just change the last few positions of account_address and network_private_key to random hex.