Failover and migrate Validator Nodes for less downtime

Failover and migrate Validator Nodes for less downtime

Why we need to migrate

Even if it is not due to the service provider’s policy, we need to migrate our validator nodes sometime for

  1. Validator node failover. e.g. Hardware failure or network outage

  2. Necessary maintenance reboot. e.g. Apply security fixes for the kernel

  3. Scale the node. e.g. Larger storage or more CPU to meet new hardware requirements

The most feasible method

As a community node operator, we run a validator node and a VFN(validator full node).
The data is actually almost the same for two nodes, the VFN is missing the consensus_db and secure-data.json but there is is no issue when converting it into a validator node.
The file structure of docker volume is as follows

ubuntu@validator /var/lib/docker/volumes # tree -L 4
.
β”œβ”€β”€ aptos-validator
β”‚   └── _data
β”‚       β”œβ”€β”€ db
β”‚       β”‚   β”œβ”€β”€ consensus_db
β”‚       β”‚   β”œβ”€β”€ ledger_db
β”‚       β”‚   β”œβ”€β”€ state_merkle_db
β”‚       β”‚   └── state_sync_db
β”‚       └── secure-data.json
β”œβ”€β”€ backingFsBlockDev
└── metadata.db
ubuntu@VFN /var/lib/docker/volumes # tree -L 4
.
β”œβ”€β”€ aptos-fullnode
β”‚   └── _data
β”‚       └── db
β”‚           β”œβ”€β”€ ledger_db
β”‚           β”œβ”€β”€ state_merkle_db
β”‚           └── state_sync_db
β”œβ”€β”€ backingFsBlockDev
└── metadata.db

Practical Implementation

If we want to achieve fast migrate, of course we have to put the validator node files on the VFN and vice versa. The configuration files on my validator node and VFN are shown below

ubuntu@validator ~/mainnet-vn $ tree
.
β”œβ”€β”€ blocked.ips
β”œβ”€β”€ docker-compose-fullnode.yaml                     #extra file
β”œβ”€β”€ docker-compose.yaml
β”œβ”€β”€ fullnode.yaml                                    #extra file
β”œβ”€β”€ genesis.blob
β”œβ”€β”€ haproxy.cfg
β”œβ”€β”€ haproxy-fullnode.cfg                             #extra file
β”œβ”€β”€ keys
β”‚   β”œβ”€β”€ validator-full-node-identity.yaml            #extra file 
β”‚   └── validator-identity.yaml
β”œβ”€β”€ validator.yaml
└── waypoint.txt
ubuntu@VFN ~/mainnet-fn $ tree
.
β”œβ”€β”€ blocked.ips
β”œβ”€β”€ docker-compose-validator.yaml                    #extra file
β”œβ”€β”€ docker-compose.yaml
β”œβ”€β”€ fullnode.yaml
β”œβ”€β”€ genesis.blob
β”œβ”€β”€ haproxy.cfg                                      #extra file
β”œβ”€β”€ haproxy-fullnode.cfg
β”œβ”€β”€ keys
β”‚   β”œβ”€β”€ validator-full-node-identity.yaml
β”‚   └── validator-identity.yaml                      #extra file
β”œβ”€β”€ validator.yaml                                   #extra file
└── waypoint.txt

Where the docker-compose-validator.yaml file we can copy from the validator node then modify the volume name from aptos-validator to aptos-fullnode in 3 places and vice versa.

    volumes:
      - type: volume
        source: aptos-fullnode
        target: /opt/aptos/data
volumes:
  aptos-fullnode:
    name: aptos-fullnode

Once we have the files ready then we can modify the DNS resolution, for ppl using IP addresses can run aptos node update-validator-network-addresses before the epoch is about to change, then open two terminal and ssh to nodes separately.

Simply stopped the validator node with docker compose down then stop the VFN and run docker with validator node yaml, It is finished. Just as fast as updating the docker image.
docker compose down && mv docker-compose.yaml docker-compose-fullnode.yaml && mv docker-compose-validator.yaml docker-compose.yaml && docker compose up -d

Even if new DNS record is not in effect, the outbound connection can participate in consensus and proposals. Or you can use the following command, but I prefer the longer former to avoid forgetting the -f for subsequent operations
docker compose down && docker compose -f docker-compose-validator.yaml up -d

For community nodes, we can even open an extra terminal to observe the number of proposals and migrate immediately after completing a new proposal since the number of proposals in each epoch is small. If the interval between two proposals is only a few seconds then you are not lucky :joy:

watch -n 1 "curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_proposals_count"" 

Creating a new full node instead using current VFN

Of course creating a new fullnode and converting it to a validator node is also an choice. You can simply copy all the files on the VFN and remember to modify the validator-full-node-identity.yaml file , this will work.

Jing: Hi, for folks running more than one vfn, please make sure each vfn has a unique identity or it causes issues with our telemetry. If it’s hard for you to set things up that way, please turn off the additional vfns until you are able to sort it out

And since we don’t need to register this node on-chain, we can just change the last few positions of account_address and network_private_key to random hex.

6 Likes

You have a valid point mate.:+1:

1 Like

Thanks mate!

1 Like

Yes,he made a great and interesting comment

1 Like

Thanks for the good post

1 Like