Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Blockchain applications (especially when running validator nodes) are a-typical from "traditional" web server applications because their performance characteristics tend to be different in the way specified below:
Tend to be more disk I/O heavy: Traditional web apps will typically offload data storage to persistent stores such as a database. In case of a blockchain/validator node, the database is on the machine itself, rather than offloaded to separate machine with a standalone engine. Many blockchains use LevelDB for their local data copies. (In Cosmos SDK apps, such as cheqd, this is the Golang implementation of LevelDB, but can also be C-implementation of LevelDB, RocksDB, etc.) The net result is the same as if you were trying to run a database engine on a machine: the system needs to have fast read/write performance characteristics.
Validator nodes cannot easily be auto-scaled: Many traditional applications can be horizontally (i.e., add more machines) or vertically (i.e., make current machine beefier) scaled. While this is possible for validator nodes, it must be done with extreme caution to ensure there aren't two instances of the same validator active simultaneously. This can be perceived by network consensus as a sign of compromised validator keys and lead to the node being jailed for double-signing blocks. These concerns are less relevant for non-validating nodes, since they have a greater tolerance for missed blocks and can be scaled horizontally/vertically.
Docker/Kubernetes setups are not recommended for validators (unless you really know what you're doing): Primarily due to the double-signing risk, it's (../setup-and-configure/docker.md) unless you have a strong DevOps practice. The other reason is related to the first point, i.e., a Docker setup adds an abstraction layer between the actual underlying file-storage vs the Docker volume engine. Depending on the Docker (or similar abstraction) storage drivers used, you may need to tune the storage/volume engine options for optimal performance.
⚠️ Please ensure you are running the latest stable release of cheqd-node since they may contain fixes/patches that improve node performance.
If you've got monitoring built in for your machine, a memory (RAM) leak would look like a graph where memory usage grows to 100%, falls off a cliff, grows to 100% again (the process repeats itself).
Normal memory usage may grow over time, but will not max out the available memory up to 100%. The graph below is taken from a server run by the cheqd team, over a 14-day period:
Figure 1: Graph showing normal memory usage on a cheqd-node server
A "CPU leak", i.e., where one or more process(es) consume increasing amounts of CPU is rarer, but could also happen if your machine has too few vCPUs and/or underpowered CPUs.
Figure 2: Graph showing normal CPU usage on a cheqd-node server
There's a catch here: depending on your monitoring tool, "100% CPU" could be measured differently! The graph above is from DigitalOcean's monitoring tools, which counts the sum of all CPU capacity as "100%".
Other monitoring tools, such as Hetzner Cloud's, count each CPU as "100%", thus making the overall figure displayed in the graph (shown below) add up to number of CPUs x 100%.
Figure 4: Graph showing CPU usage on Hetzner cloud, adding up to more than 100%
Check what accounting metric your monitoring tool uses to get a realistic idea of whether your CPU is overloaded or not.
Load average is another useful measure of the responsiveness of a machine, regardless of the CPU usage.
If you don't have a monitoring application installed, you could use the built-in top
or htop
command.
Figure 2: Output of htop
showing CPU and memory usage
htop
is visually easier to understand than top
since it breaks down usage per-CPU, as well as memory usage.
Unfortunately, this only provides the real-time usage, rather than historical usage over time. Historical usage typically requires an external application, which many cloud providers provide, or through 3rd party monitoring tools such as Datadog, etc.
Tendermint / Cosmos SDK also provides a Prometheus metrics interface, in case you already have a Prometheus instance you can use or comfortable with using the software. This can allow alerting based on actual metrics emitted by the node, rather than just top-level system metrics which are a blunt instrument / don't go into detail.
If your system clock is out of synchronisation, this could cause Tendermint peer-to-peer connections to be rejected. This is similar to how SSL/TLS connections can get rejected with a "handshake error" in a normal browser when accessing secure (HTTPS) sites.
The net result of your system clock being out of sync is that your node:
Constantly tries to dial peers to try and fetch new blocks
Connection gets rejected by some/all of them
Keeps retrying the above until CPU/memory get exhausted, or the node process crashes
To check if your system clock is synchronised, use the following command (note: only copy the command, not the sample output):
The timezone your machine is based in doesn't matter. You should check whether it reports System clock synchronized: yes
and NTP service: active
.
If either of these are not true, chances are that your system clock has fallen out of sync, and may be the root cause of CPU/memory leaks. Follow this guide on setting time synchronisation in Ubuntu to resolve the issue, and then monitor whether it fixes high utilisation.
You may also need to allow outbound UDP traffic on port 123 explicitly, depending on your firewall settings. This port is used by the Network Time Protocol (NTP) service.
Properly-configured nodes should have bidirectional connectivity for network traffic. To check whether this is the case, open <node-ip-address-or-dns-name:rpc-port>/net_info
in your browser, for example, rpc.cheqd.net/net_info.
Accessing this endpoint via your browser would only work if traffic to your RPC port is allowed through your firewall and/or you're accessing from an allowed origin. If this is not the case, you can also view the results for this endpoint from the same machine where your node service is running through the command line:
The JSON output should be similar to below:
Look for the n_peers
value at the beginning: this shows the number of peers your node is connected. A healthy node would typically be connected to anywhere between 5-50 nodes.
Next, search the results for the term is_outbound
. The number of matches for this term should exactly be the same as the value of n_peers
, since this is printed once per peer. The value of is_oubound
may either be true
or false
.
A healthy node should have a mix of is_outbound: true
as well as is_outbound: false
. If your node reports only one of these values, it's a strong indication that your node is unidirectionally connected/reachable, rather than bidirectionally reachable.
Unidirectional connectivity may cause your node to work overtime to stay synchronised with latest blocks on the network. You may fly by just fine - until there's a loss of connectivity to critical mass of peers and then your node goes offline.
Furthermore, your node might fetch the address book from seed nodes, and then try to resolve/contact them (and fail) due to connectivity issues.
Ideally, the IP address or DNS name set in external_address
property in your config.toml
file should be externally reachable.
To determine whether this is true, from a machine other than your node, install tcptraceroute
. Unlike ping
which uses ICMP packets, tcptraceroute
uses TCP, i.e., the actual protocol used for Tendermint P2P to see if the destination is reachable. Success or failure in connectivity using ping
doesn't prove whether your node is reachable, since firewalls along the path may have different rules for ICMP vs TCP.
Once you have tcptraceroute
installed, from this external machine you can execute the following command in tcptraceroute <hostname> <port>
format (note: only copy the actual command, not sample output):
A successful run would result in tcptraceroute
reaching the destination server on the required port (e.g., 26656) and then hanging up. If the connection times out consistently at any of the hops, this could indicate there's a firewall / router in the path dropping or blocking connections.
Your firewall rules on the machine and/or infrastructure (cloud) provider could cause connectivity issues. Ideally, your firewall rules should allow:
Inbound TCP traffic on at least port 26656 (or custom P2P port)
Optionally, inbound TCP traffic on other ports (RPC, gRPC, gRPC Web)
Outbound TCP traffic on all ports
Besides firewalls, depending on your network infrastructure, your connectivity issue instead might lie in a router or Network Address Translation (NAT) gateway.
Outbound TCP traffic is the default mode on many systems, since the port through which traffic gets routed out is dynamically determined during TCP connection establishment. In some cases, e.g., when using a NAT gateway in AWS, you may require more complex configuration (outside the scope of this document).
In addition to infrastructure-level firewalls, Ubuntu machines also come with firewall on the machine itself. Typically, this is either disabled or set to allow all traffic by default.
Configuring OS-level firewalls is outside the scope of this document, but can generally be checked/configured using the ufw
utility:
If ufw status
reports active, follow this guide on configuring firewall rules using ufw
to allow traffic on the required ports (customise the ports to the required ports).
Another common reason for unidirectional node connectivity occurs when the correct P2P inbound/outbound traffic is allowed in firewalls, but DNS traffic is blocked by a firewall.
Your node needs the ability to lookup DNS queries to resolve nodes with DNS names as their external_address
property to IP addresses, since other peers may advertise their addresses as a DNS name. Seed nodes set in config.toml
are a common example of this, since these are advertised as DNS names.
Your node may still scrape by if DNS resolution is blocked, for example, by obtaining an address book from a peer that has already done DNS -> IP resolution. However, this approach can be liable to break down if the resolution is incorrect or entries outdated.
To enable DNS lookups, your infrastructure/OS-level firewalls should allow:
Outbound UDP traffic on port 53: This is the most commonly-used port/protocol.
Outbound TCP traffic on port 853 (explicit rule not needed if you already allow TCP outbound on all ports): Modern DNS servers also allow DNS-over-TLS, which secures the connection using TLS to the DNS server. This can prevent malicious DNS servers from intercepting queries and giving spurious responses.
Outbound TCP traffic on port 443 (explicit rule not needed if you already allow TCP outbound on all ports): Similar to above, this enables DNS-over-HTTPS, if supported by your DNS resolver.
To check DNS resolution work, try to run a DNS query and see if it returns a response. The following command will use the dig
utility to look up and report your node's externally resolvable IP address via Cloudflare's 1.1.1.1 DNS resolver (note: only copy the command, not the sample output):
If the lookup fails, that could indicate DNS queries or blocked, or there are no externally-resolvable IPs where the node can be reached.
If your machine is provisioned with the bare minimum of CPU and RAM, you might find that the node struggles during times of high load, or slowly degrades over time. The minimum figures are recommended for a developer setup, rather than a production-grade node.
Typically, this problem is seen if you (non-exhaustive list):
Have only one CPU (bump to at least two CPU)
Only 1-2 GB of RAM (bump to at least 4 GB)
Most cloud providers should allow dynamically scaling these two factors without downtime. Monitor - especially over a period of days/weeks - whether this improves the situation or not. If the CPU/memory load behaviour remains similar, that likely indicates the issue is different.
Scaling CPU/memory without downtime may be different you're running a physical machine, or if your cloud provider doesn't support it. Please follow the guidance of those hosting platforms.
Cosmos SDK and Tendermint has a concept of pruning, which allows reducing the disk utilisation and .
There are two kinds of pruning controls available on a node:
Tendermint pruning: This impacts the ~/.cheqdnode/data/blockstore.db/
folder by only retaining the last n specified blocks. Controlled by the min-retain-blocks
parameter in ~/.cheqdnode/config/app.toml
.
Cosmos SDK pruning: This impacts the ~/.cheqdnode/data/application.db/
folder and prunes Cosmos SDK app-level state (a logical layer higher than Tendermint, which is just peer-to-peer). These are set by the pruning
parameters in the ~/.cheqdnode/config/app.toml
file.
This can be done by modifying the pruning parameters inside /home/cheqd/.cheqdnode/config/app.toml
file.
⚠️ In order for either type of pruning to work, your node should be running the (at least v1.3.0+).
You can check which version of cheqd-noded
you're running using:
The output should be a version higher than v1.3.0. If you're on a lower version, you either manually upgrade the node binary or while retaining settings.
The instructions below assume that the home directory for the cheqd
user is set to the default value of /home/cheqd
. If this is not the case for your node, please modify the commands below to the correct path.
Follow to check systemd service status:
(Substitute with cheqd-noded.service
if you're running a standlone node rather than with Cosmovisor.)
cheqd
user and configuration directorySwitch to the cheqd
user and then the .cheqdnode/config/
directory.
Before you make changes to pruning configuration, you might want to capture the existing usage first (only copy the command bit, not the full line):
The du -h -d 1 ...
command above prints the disk usage for the specified folder down to one folder level depth (-d 1
parameter) and prints the output in GB/MB (-h
parameter, which prints in human-readable values).
app.toml
file for editingOpen the app.toml
file once you've switched to the ~/.cheqdnode/config/
folder using your preferred text editor, such as nano
:
⚠️ If your node was configured to work with release version v1.2.2 or earlier, you may have been advised to run in
pruning="nothing"
mode due to a bug in Cosmos SDK.
The file should already be populated with values. Edit the pruning
parameter value to one of following:
pruning="nothing"
(highest disk usage): This will disable Cosmos SDK pruning and set your node to behave like an "archive" node. This mode consumes the highest disk usage.
pruning="default"
(recommended, moderate disk usage): This keeps the last 100 states in addition to every 500th state, and prunes on 10-block intervals. This configuration is safe to use on all types of nodes, especially validator nodes.
pruning="everything"
(lowest disk usage): This mode is not recommended when running validator nodes. This will keep the current state and also prune on 10 blocks intervals. This settings is useful for nodes such as seed/sentry nodes, as long as they are not used to query RPC/REST API requests.
pruning="custom"
(custom disk usage): If you set the pruning
parameter to custom
, you will have to modify two additional parameters:
pruning-keep-recent
: This will define how many recent states are kept, e.g., 250
(contrast this against default
).
pruning-interval
: This will define how often state pruning happens, e.g., 50
(contrast against default
, which does it every 10 blocks)
pruning-keep-every
: This parameter is deprecated in newer versions of Cosmos SDK. You can delete this line if it's present in your app.toml
file.
Although the paramters named pruning-*
are only supposed to take effect if the pruning strategy is custom
, in practice it seems that in Cosmos SDK v0.46.10 these settings still impact pruning. Therefore, you're advised to comment out these lines when using default
pruning.
Example configuration file with recommended settings:
Configuring min-retain-blocks
parameter to a non-zero value activates Tendermint pruning, which specifies minimum block height to retain. By default, this parameter is set to 0
, which disables this feature.
Enabling this feature can reduce disk usage significantly. Be careful in setting a value, as it must be at least higher than 250,000 as calculated below:
...divided by average block time = approx. 6s / block
= approx. 210,000 blocks
Adding a safety margin (in case average block time goes down) = approx. 250,000 blocks
Therefore, this setting must always be updated to carefully match a valid value in case the unbonding time on the network you're running on is different. (E.g., this value is different on mainnet vs testnet due to different unbonding period.)
Using the recommended values, on the current cheqd mainnet this section would look like the following:
Save and exit from the app.toml
file. Working with text editors is outside the scope of this document, but in general under nano
this would be Ctrl+X
, "yes" to Save modified buffer
, then Enter
.
ℹ️ NOTE: You need root or at least a user with super-user privileges using the
sudo
prefix to the commands below when interact with systemd.
If you switched to the cheqd user, exit out to a root/super-user:
Usually, this will switch you back to root
or other super-user (e.g., ubuntu
).
Restart systemd service:
(Substitute with cheqd-noded.service
above if you're running without Cosmovisor)
Check the systemd service status and confirm that it's running:
If you activate/modify any pruning configuration above, the changes to disk usage are NOT immediate. Typically, it may take 1-2 days over which the disk usage reduction is progressively applied.
If you've gone from a higher disk usage setting to a lower disk usage setting, re-run the disk usage command to comapre the breakdown of disk usage in the node data directory:
The output shown should show a difference in disk usage from the previous run before settings were changed for the application.db
folder (if the pruning
parameters were changed) and/or the blockstore.db
folder (if min-retain-blocks
) was changed.
This document provides guidance on how configure and promote a cheqd node to validator status. Having a validator node is necessary to participate in staking rewards, block creation, and governance.
You must already have a running cheqd-node
instance installed using one of the supported methods.
Please also ensure the node is fully caught up with the latest ledger updates.
(recommended method)
Follow the guidance on to create a new account key.
When you create a new key, a new account address and mnemonic backup phrase will be printed. Keep the mnemonic phrase safe as this is the only way to restore access to the account if they keyring cannot be recovered.
P.S. in case of using Ledger Nano device it would be helpful to use
Get your node ID
Follow the guidance on to fetch your node ID.
Get your validator account address
The validator account address is generated in Step 1 above when a new key is added. To show the validator account address, follow the .
(The assumption above is that there is only one account / key that has been added on the node. In case you have multiple addresses, please jot down the preferred account address.)
Ensure your account has a positive balance
Get your node's validator public key
Promote your node to validator status by staking your token balance
You can decide how many tokens you would like to stake from your account balance. For instance, you may want to leave a portion of the balance for paying transaction fees (now and in the future).
To promote to validation, submit a create-validator
transaction to the network:
Parameters required in the transaction above are:
amount
: Amount of tokens to stake. You should stake at least 1 CHEQ (= 1,000,000,000ncheq) to successfully complete a staking transaction.
from
: Key alias of the node operator account that makes the initial stake
min-self-delegation
: Minimum amount of tokens that the node operator promises to keep bonded
pubkey
: Node's bech32
-encoded validator public key from the previous step
commission-rate
: Validator's commission rate. The minimum is set to 0.05
.
commission-max-rate
: Validator's maximum commission rate, expressed as a number with up to two decimal points. The value for this cannot be changed later.
commission-max-change-rate
: Maximum rate of change of a validator's commission rate per day, expressed as a number with up to two decimal points. The value for this cannot be changed later.
chain-id
: Unique identifier for the chain.
For cheqd's current mainnet, this is cheqd-mainnet-1
For cheqd's current testnet, this is cheqd-testnet-6
gas
: Maximum gas to use for this specific transaction. Using auto
uses Cosmos's auto-calculation mechanism, but can also be specified manually as an integer value.
gas-adjustment (optional): If you're using auto
gas calculation, this parameter multiplies the auto-calculated amount by the specified factor, e.g., 1.4
. This is recommended so that it leaves enough margin of error to add a bit more gas to the transaction and ensure it successfully goes through.
gas-prices
: Maximum gas price set by the validator. Default value is 50ncheq
.
Please note the parameters below are just an “example”.
You will see the commission they set, the max rate they set, and the rate of change. Please use this as a guide when thinking of your own commission configurations. This is important to get right, because the commission-max-rate
and commission-max-change-rate
cannot be changed after they are initially set.
Check that your validator node is bonded
Checking that the validator is correctly bonded can be checked via any node:
Find your node by moniker
and make sure that status
is BOND_STATUS_BONDED
.
Check that your validator node is signing blocks and taking part in consensus
Query the latest block. Open <node-address:rpc-port/block
in a web browser. Make sure that there is a signature with your validator address in the signature list.
To use your Ledger Nano you will need to complete the following steps:
Set-up your wallet by creating a PIN and passphrase, which must be stored securely to enable recovery if the device is lost or damaged.
Connect your device to your PC and update the firmware to the latest version using the Ledger Live application.
Install the Cosmos application using the software manager (Manager > Cosmos > Install).
Adding a new key In order to use the hardware wallet address with the cli, the user must first add it via cheqd-noded
. This process only records the public information about the key.
To import the key first plug in the device and enter the device pin. Once you have unlocked the device navigate to the Cosmos app on the device and open it.
To add the key use the following command:
Note
The --ledger
flag tells the command line tool to talk to the ledger device and the --index
flag selects which HD index should be used.
When running this command, the Ledger device will prompt you to verify the genereated address. Once you have done this you will get an output in the following form:
On completion of the steps above, you would have successfully bonded a node as validator to the cheqd testnet and participating in staking/consensus.
Validator nodes can get "jailed" along with a penalty imposed (through its stake getting slashed). Unlike a proof-of-work (PoW) network (such as Ethereum or Bitcoin), proof-of-stake (PoS) networks (such as the cheqd network, built using ) use from validators.
There are two scenarios in which a validator could be jailed, one of which has more serious consequences than the other.
When a validator "misses" blocks or doesn't participate in consensus, it can get temporarily jailed. By enforcing this check, PoS networks like ours ensure that validators are actively participating in the operation of the network, ensuring that their nodes remain secure and up-to-date with the latest software releases, etc.
The duration on how this is calculated is defined in the . Jailing occurs based on a sliding time window (called the ) calculated as follows.
The signed_blocks_window
(set to 25,920 blocks on mainnet) defines the time window that is used to calculate downtime.
Within this window of 25,920 blocks, at least 50% of the blocks must be signed by a validator. This is defined in the genesis parameter min_signed_per_window
(set to 0.5
for mainnet).
Therefore, if a validator misses 12,960 blocks within the last 25,920 blocks it meets the criteria for getting jailed.
To convert this block window to a time period, consider the block time of the network, i.e., at what frequency a new block is created. The or any other explorer configured for cheqd network (such as ).
Let's assume the block time was 6 seconds. This equates to 12,960 * 6 = 77,760 seconds = ~21.6 hours. This means if the validator is not participating in consensus for more than ~21.6 hours (in this example), it will get temporarily jailed.
Since the block time of the network is variable on the number of nodes participating, network congestion, etc it's always important to calculate the time period on latest block time figures.
1% of all of the stake delegated to the node is slashed, i.e., burned and disappears forever. This includes any stake delegated to the node by external parties. (If a validator gets jailed, delegators may decide to switch whom they delegate to.) The percentage of stake to be slashed is defined in the slash_fraction_downtime
genesis parameter.
During the downtime of a Validator Node, it is common for the Node to miss important software upgrades, since they are no longer in the active set of nodes on the main ledger.
Therefore, the first step is checking that your node is up to date. You can execute the command
The expected response will be the latest cheqd-noded software release. At the time of writing, the expected response would be
Once again, check if your node is up to date, following Step 1.
Expected response: In the output, look for the text latest_block_height
and note the value. Execute the status command above a few times and make sure the value of latest_block_height
has increased each time.
The node is fully caught up when the parameter catching_up
returns the output false.
Additionally,, you can check this has worked:
It shows you a page and field "version": "0.6.0".
If everything is up to date, and the node has fully caught, you can now unjail your node using this command in the cheqd CLI:
This document offers guidance for validators looking to move thier node instance to another one, for example in case of changing VPS provider or something like this.
The main tool required for this is cheqd's .
Before completing the move, ensure the following checks are completed:
config
directory and data/priv_validator_state.json
to safe placeCheck that your config
directory and data/priv_validator_state.json
are copied to a safe place where they will cannot affected by the migration
If you are using cosmosvisor, use systemctl stop cheqd-cosmovisor
For all other cases, use systemctl stop cheqd-noded
.
This step is of the utmost important
If your node is not stopped correctly and two nodes are running with the same private keys, this will lead to a double signing infraction which results in your node being permemently jailed (tombstoned) resulting in a 5% slack of staked tokens.
You will also be required to complete a fresh setup of your node.
Only after you have completed the preparation steps to shut down the previous node, the installation should begin.
Once this has been completed, you will be able to move your existing keys back and settings.
The answers for installer quiestion could be:
Here you can pick up the version what you want.
Set path for cheqd user's home directory [default: /home/cheqd]:
.
This is essentialy a question about where the home directory, cheqdnode
, is located or will be.
It is up to operator where they want to store data
, config
and log
directories.
Do you want to setup a new cheqd-node? (yes/no) [default: yes]:
Here the expected answer is No
.
The main idea is that our old config
directory will be used and data
will be restored from the snapshot.
We don't need to setup the new one.
Select cheqd network to join (testnet/mainnet) [default: mainnet]:
For now, we have 2 networks, testnet
and mainnet
.
Type whichever chain you want to use or just keep the default by clicking Enter
.
Install cheqd-noded using Cosmovisor? (yes/no) [default: yes]:
.
This is also up to the operator.
CAUTION: Downloading a snapshot replaces your existing copy of chain data. Usually safe to use this option when doing a fresh installation. Do you want to download a snapshot of the existing chain to speed up node synchronisation? (yes/no) [
default: yes
].
On this question we recommend to answer Yes
, cause it will help you to catchup with other nodes in the network. That is the main feature from this installer.
Copy config
directory to the CHEQD_HOME_DIRECTORY/.cheqdnode/
Copy data/priv_validator_state.json
to the CHEQD_HOME_DIRECTORY/.cheqdnode/data
Make sure that permissions are cheqd:cheqd
for CHEQD_HOME_DIRECTORY/.cheqdnode
directory. For setting it the next command can help $ sudo chown -R cheqd:cheqd CHEQD_HOME_DIRECTORY/.cheqdnode
Where CHEQD_HOME_DIRECTORY
is the home directory for cheqd
user. By default it's /home/cheqd
or what you answered during the installation for the second question.
You need to specify here new external address by calling the next command under the cheqd
user:
The latest thing in this doc is to run the service and check that all works fine.
where <service-name>
is a name of service depending was Install Cosmovisor
selected or not.
cheqd-cosmovisor
if Cosmovisor was installed.
cheqd-noded
in case of keeping cheqd-noded
as was with debian package approach.
For checking that service works, please run the next command:
where <service-name>
has the same meaning as above.
The status should be Active(running)
If you're running a validator node, it's important to backup your validator's keys and state - especially before attempting any updates or shifting nodes.
Each validator node has three files/secrets that must be backed up, in case you want to restore or move a node. Anything not in scope listed below can be easily restored from snapshot or otherwise replaced with fresh copies, and as such this list is the bare minimum that needs to be backed up.
$CHEQD_HOME
is the data directory for your node, which defaults to/home/cheqd/.cheqdnode
The validator private key is one of the most important secrets that uniquely identifies your validator, and what this node uses to sign blocks, participate in consensus etc. This file is stored under $CHEQD_HOME/config/priv_validator_key.json
.
In the same folder as your validator private key, there's another key called $CHEQD_HOME/config/node_key.json
. This key is used to derive the node ID for your validator.
Backing up this key means if you move or restore your node, you don't have to change the node ID in the configuration files any peers have. This is only relevant (usually) if you're running multiple nodes, e.g., a sentry or seed node.
For most node operators who run a singular validator node, this node key is NOT important and can be refreshed/created as new. It is only used for Tendermint peer-to-peer communication. Hypothetically, if you created a new node key (say when moving a node from one machine to another), and then restored the priv_validator_key.json
, this is absolutely fine.
The validator private state is stored in the data
folder, not the config
folder where most other configuration files are kept - and therefore often gets missed by validator operators during backup. This file is stored at $CHEQD_HOME/data/priv_validator_state.json
.
This file stores the last block height signed by the validator and is updated every time a new block is created. Therefore, this should only be backed up after stopping the node service, otherwise, the data stored within this file will be in an inconsistent state. An example validator state file is shown below:
If you forget to restore to the validator state file when restoring a node, or when restoring a node from snapshot, your validator will double-sign blocks it has already signed, and get jailed permanently ("tombstoned") with no way to re-establish the validator.
The software upgrades and block height they were applied at is stored in $CHEQD_HOME/data/upgrade-info.json
. This file is used by Cosmovisor to track automated updates, and informs it whether it should attempt an upgrade/migration or not.
The simplest way to backup your validator secrets listed above is to display them in your terminal:
You can copy the contents of the file displayed in terminal off the server and store it in a secure location.
To restore the files, open the equivalent file on the machine where you want to restore the files to using a text editor (e.g., nano
) and paste in the contents:
You also need a running HashiCorp Vault server cluster you can use to proceed with this guide.
Once you have Vault CLI set up on the validator, you need to set up environment variables in your terminal to configure which Vault server the secrets need to be backed up to.
Add the following variables to your terminal environment. Depending on which terminal you use (e.g., bash, shell, zsh, fish etc), you may need to modify the statements accordingly. You'll also need to modify the values according to your validator and Vault server configuration.
Make the script executable:
We recommend that you open the script using an editor such as nano
and confirm that you're happy with the environment variables and settings in it.
Before backing up your secrets, it's important to stop the cheqd node service or Cosmovisor service; otherwise, the validator private state will be left in an inconsistent state and result in an incorrect backup.
If you're running via Cosmovisor (the default option), this can be stopped using:
Or, if running as a standalone service:
Once you've confirmed the cheqd service is stopped, execute the Vault backup script:
We use HashiCorp Vault KV v2 secrets engine. Please make sure that it's enabled and mounted under
cheqd
path.
To restore backed-up secrets from a Vault server, you can use the same script using the -r
("restore") flag:
If you're restoring to a different machine than the original machine the backup was done from, you'll need to go through the pre-requisites, CLI setup step, and download the Vault backup script to the new machine as well.
In this scenario, you're also also recommended to disable the service (e.g., cheqd-cosmovisor
) on the original machine. This ensures that if the (original) machine gets restarted, systemd
does not try and start the node service as this can potentially result in two validators running with the same validator keys (which will result in tombstoning).
Once you've successfully restored, you can enable the service (e.g., cheqd-cosmovisor
) on the new machine:
When you set up your Validator node, it is recommended that you only stake a very small amount from the actual Validator node. This is to minimise the tokens that could be locked in an unbonding period, were your node to experience signficiant downtime.
You should delegate the rest of your tokens to your Validator node from a different key alias.
How do I do this?
You can add (as many as you want) additional keys you want using the function:
When you create a new key, a mnemonic phrase and account address will be printed. Keep the mnemonic phrase safe as this is the only way to restore access to the account if they keyring cannot be recovered.
You can view all created keys using the function:
You are able to transfer tokens between key accounts using the function.
You can then delegate to your Validator Node, using the function
We use a second/different Virtual Machine to create these new accounts/wallets. In this instane, you only need to install cheqd-noded as a binary, you don't need to run it as a full node.
And then since this VM is not running a node, you can then append the --node parameter to any request and target the RPC port of the VM running the actual node.
That way:
The second node doesn't need to sync the full blockchain; and
You can separate out the keys/wallets, since the IP address of your actual node will be public by definition and people can attack it or try to break in
I’d recommend at least 250 GB at the current chain size. You can choose to go higher, so that you don’t need to revisit this. Within our team, we set alerts on our cloud providers/Datadog to raise alerts when nodes reach 85-90% storage used which allows us to grow the disk storage as and when needed, as opposed to over-provisioning.
Here’s the relevant section in the file:
Green: 90-100% blocks signed
Amber: 70-90% blocks signed
Red: 1-70% blocks signed
Please join the channel 'mainnet-alerts' on the cheqd community slack.
Yes! Here are a few other suggestions:
You can check the current status of disk storage used on all mount points manually through the output of df -hT
The default storage path for cheqd-node is on /home/cheqd
. By default, most hosting/cloud providers will set this up on a single disk volume under the /
(root) path. If you move and mount /home
on a separate disk volume, this will allow you to expand the storage independent of the main volume. This can sometimes make a difference, because if you leave /home
tree mounted on /
mount path, many cloud providers will force you to bump the whole virtual machine category - including the CPU and RAM - to a more expensive tier in order to get additional disk storage on /
. This can also result in over-provisioning since the additional CPU/RAM is likely not required.
You can also optimise the amount of logs stored, in case the logs are taking up too much space. There’s a few techniques here:
In config.toml
you can set the logging level to error
for less logging than the default which is info
. (The other possible value for this is debug
)
[Set the log rotation configuration to use different/custom parameters such as what file-size to rotate at, number of days to retain etc.
As a Validator Node, you should be familiar with the concept of commission. This is the percentage of tokens that you take as a fee for running the infrastructure on the network. Token holders are able to delegate tokens to you, with an understanding that they can earn staking rewards, but as consideration, you are also able to earn a flat percentage fee of the rewards on the delegated stake they supply.
There are three commission values you should be familiar with:
The first is the maximum rate of commission that you will be able to move upwards to.
Please note that this value cannot be changed once your Validator Node is set up, so be careful and do your research.
The second parameter is the maximum amount of commission you will be able to increase by within a 24 hour period. For example if you set this as 0.01, you will be able to increase your commission by 1% a day.
The third value is your current commission rate.
Points to note: lower commission rate = higher likelihood of more token holders delegating tokens to you because they will earn more rewards. However, with a very low commission rate, in the future, you might find that the gas fees on the Network outweight the rewards made through commission.
higher commission rate = you earn more tokens from the existing stake + delegated tokens. But the tradeoff being that it may appear less desirable for new delegators when compared to other Validators.
When setting up the Validator, the Gas parameter is the amount of tokens you are willing to spend on gas.
For simplicity, we suggest setting:
AND setting:
These parameters, together, will make it highly likely that the transaction will go through and not fail. Having the gas set at auto, without the gas adjustment will endanger the transaction of failing, if the gas prices increase.
Gas prices also come into play here too, the lower your gas price, the more likely that your node will be considered in the active set for rewards.
We suggest the set:
should fall within this recommended range:
Low: 25ncheq
Medium: 50ncheq
High: 100ncheq
Your public name, is also known as your moniker.
You are able to change this, as well as the description of your node using the function:
Yes, this is how you should do it. Since it's a public permissionless network, there's no way of pre-determining what the set of IP addresses will be, as entities may leave and join the network. We suggest using a TCP/network load balancer and keeping your VM/node in a private subnet though for security reasons. The LB then becomes your network edge which if you're hosting on a cloud provider they manage/patch/run.
Instructions on how to use text editors such as nano
is out of the scope of this document. If you're unsure how to use it, consider following .
Ensure you've upgraded to the () or otherwise. When running a validator node, you're recommended to change this value to pruning="default"
.
(14 days) converted to seconds = 1,210,000 seconds
Our has a section on how to check service status.
Follow the guidance on to check that your account is correctly showing the CHEQ testnet tokens provided to you.
The node validator public key is required as a parameter for the next step. More details on validator public key is mentioned in the .
When setting parameters such as the commission rate, a good benchmark is to consider the .
Find out your and look for "ValidatorInfo":{"Address":"..."}
:
Learn more about what you can do with your new validator node in the .
If your node is not up to date, please
In general, the installer allows you to install the binary and download/extract the latest snapshot from .
If the installation process was successful, the next step is to get back the configurations from :
is an open-source project that allows server admins to run a secure, access-controlled off-site backup for secrets. You can either or .
Before you get started with this guide, make sure you've on the validator you want to run backups from.
Setting up a HashiCorp Vault cluster is outside the scope of this documentation, since it can vary a lot depending on your setup. If you don't already have this set up, and is the best place to get started.
Download the from Github:
Yes, you can. You can do this by to more aggressive parameters in the app.toml
file.
Please also see this thread on the trade-offs involved. This will help to some extent, but please note that this is a general property of all blockchains that the chain size will grow. E.g., out of the gate. We recommend using alerting policies to grow the disk storage as needed, which is less likely to require higher spend due to over-provisioning.
One of the simplest ways to do this is to , and with a more detailed view on the per-validator page (, for example). The condition is scored based on :
We have also internally that takes the output of this from condition score from the block explorer GraphQL API and makes it available as a simple REST API that can be used to send alerts on Slack, Discord etc which we have and set up on our Slack/Discord.
In addition to that, (for those who already use it for monitoring/want to set one up) that has metrics for monitoring node status (and a lot more).
You can have a look at other projects on Cosmos to get an idea of the percentages that nodes set as commission.