Troubleshooting consistently high CPU/memory loads
Last updated
Was this helpful?
Last updated
Was this helpful?
Blockchain applications (especially when running validator nodes) are a-typical from "traditional" web server applications because their performance characteristics tend to be different in the way specified below:
Tend to be more disk I/O heavy: Traditional web apps will typically offload data storage to persistent stores such as a database. In case of a blockchain/validator node, the database is on the machine itself, rather than offloaded to separate machine with a standalone engine. Many blockchains use for their local data copies. (In Cosmos SDK apps, such as cheqd, this is the , but can also be , , etc.) The net result is the same as if you were trying to run a database engine on a machine: the system needs to have fast read/write performance characteristics.
Validator nodes cannot easily be auto-scaled: Many traditional applications can be horizontally (i.e., add more machines) or vertically (i.e., make current machine beefier) scaled. While this is possible for validator nodes, it must be done with extreme caution to ensure there aren't two instances of the same validator active simultaneously. This can be perceived by network consensus as a sign of compromised validator keys and lead to the . These concerns are less relevant for non-validating nodes, since they have a greater tolerance for missed blocks and can be scaled horizontally/vertically.
Docker/Kubernetes setups are not recommended for validators (unless you really know what you're doing): Primarily due to the double-signing risk, it's (../setup-and-configure/docker.md) unless you have a strong DevOps practice. The other reason is related to the first point, i.e., a Docker setup adds an abstraction layer between the actual underlying file-storage vs the Docker volume engine. Depending on the Docker (or similar abstraction) storage drivers used, you may need to for optimal performance.
⚠️ Please ensure you are running the since they may contain fixes/patches that improve node performance.
If you've got monitoring built in for your machine, a memory (RAM) leak would look like a graph where memory usage grows to 100%, falls off a cliff, grows to 100% again (the process repeats itself).
Normal memory usage may grow over time, but will not max out the available memory up to 100%. The graph below is taken from a server run by the cheqd team, over a 14-day period:
Figure 1: Graph showing normal memory usage on a cheqd-node server
A "CPU leak", i.e., where one or more process(es) consume increasing amounts of CPU is rarer, but could also happen if your machine has too few vCPUs and/or underpowered CPUs.
Figure 2: Graph showing normal CPU usage on a cheqd-node server
Figure 4: Graph showing CPU usage on Hetzner cloud, adding up to more than 100%
Check what accounting metric your monitoring tool uses to get a realistic idea of whether your CPU is overloaded or not.
If you don't have a monitoring application installed, you could use the built-in top
or htop
command.
Figure 2: Output of htop
showing CPU and memory usage
htop
is visually easier to understand than top
since it breaks down usage per-CPU, as well as memory usage.
The net result of your system clock being out of sync is that your node:
Constantly tries to dial peers to try and fetch new blocks
Connection gets rejected by some/all of them
Keeps retrying the above until CPU/memory get exhausted, or the node process crashes
To check if your system clock is synchronised, use the following command (note: only copy the command, not the sample output):
The timezone your machine is based in doesn't matter. You should check whether it reports System clock synchronized: yes
and NTP service: active
.
You may also need to allow outbound UDP traffic on port 123 explicitly, depending on your firewall settings. This port is used by the Network Time Protocol (NTP) service.
The JSON output should be similar to below:
Look for the n_peers
value at the beginning: this shows the number of peers your node is connected. A healthy node would typically be connected to anywhere between 5-50 nodes.
Next, search the results for the term is_outbound
. The number of matches for this term should exactly be the same as the value of n_peers
, since this is printed once per peer. The value of is_oubound
may either be true
or false
.
A healthy node should have a mix of is_outbound: true
as well as is_outbound: false
. If your node reports only one of these values, it's a strong indication that your node is unidirectionally connected/reachable, rather than bidirectionally reachable.
Unidirectional connectivity may cause your node to work overtime to stay synchronised with latest blocks on the network. You may fly by just fine - until there's a loss of connectivity to critical mass of peers and then your node goes offline.
Furthermore, your node might fetch the address book from seed nodes, and then try to resolve/contact them (and fail) due to connectivity issues.
Ideally, the IP address or DNS name set in external_address
property in your config.toml
file should be externally reachable.
Once you have tcptraceroute
installed, from this external machine you can execute the following command in tcptraceroute <hostname> <port>
format (note: only copy the actual command, not sample output):
A successful run would result in tcptraceroute
reaching the destination server on the required port (e.g., 26656) and then hanging up. If the connection times out consistently at any of the hops, this could indicate there's a firewall / router in the path dropping or blocking connections.
Inbound TCP traffic on at least port 26656 (or custom P2P port)
Optionally, inbound TCP traffic on other ports (RPC, gRPC, gRPC Web)
Outbound TCP traffic on all ports
Besides firewalls, depending on your network infrastructure, your connectivity issue instead might lie in a router or Network Address Translation (NAT) gateway.
In addition to infrastructure-level firewalls, Ubuntu machines also come with firewall on the machine itself. Typically, this is either disabled or set to allow all traffic by default.
Another common reason for unidirectional node connectivity occurs when the correct P2P inbound/outbound traffic is allowed in firewalls, but DNS traffic is blocked by a firewall.
Your node needs the ability to lookup DNS queries to resolve nodes with DNS names as their external_address
property to IP addresses, since other peers may advertise their addresses as a DNS name. Seed nodes set in config.toml
are a common example of this, since these are advertised as DNS names.
Your node may still scrape by if DNS resolution is blocked, for example, by obtaining an address book from a peer that has already done DNS -> IP resolution. However, this approach can be liable to break down if the resolution is incorrect or entries outdated.
To enable DNS lookups, your infrastructure/OS-level firewalls should allow:
Outbound UDP traffic on port 53: This is the most commonly-used port/protocol.
If the lookup fails, that could indicate DNS queries or blocked, or there are no externally-resolvable IPs where the node can be reached.
Typically, this problem is seen if you (non-exhaustive list):
Have only one CPU (bump to at least two CPU)
Only 1-2 GB of RAM (bump to at least 4 GB)
Most cloud providers should allow dynamically scaling these two factors without downtime. Monitor - especially over a period of days/weeks - whether this improves the situation or not. If the CPU/memory load behaviour remains similar, that likely indicates the issue is different.
Scaling CPU/memory without downtime may be different you're running a physical machine, or if your cloud provider doesn't support it. Please follow the guidance of those hosting platforms.
There's a catch here: depending on your monitoring tool, "100% CPU" could be measured differently! The graph above is from .
Other monitoring tools, such as , count each CPU as "100%", thus making the overall figure displayed in the graph (shown below) add up to number of CPUs x 100%.
, regardless of the CPU usage.
Unfortunately, this only provides the real-time usage, rather than historical usage over time. Historical usage typically requires an external application, which many cloud providers provide, or through 3rd party monitoring tools such as , etc.
, in case you already have a Prometheus instance you can use or comfortable with using the software. This can allow alerting based on actual metrics emitted by the node, rather than just top-level system metrics which are a blunt instrument / don't go into detail.
If your , this could cause Tendermint peer-to-peer connections to be rejected. This is similar to in a normal browser when accessing secure (HTTPS) sites.
If either of these are not true, chances are that your system clock has fallen out of sync, and may be the root cause of CPU/memory leaks. Follow to resolve the issue, and then monitor whether it fixes high utilisation.
Properly-configured nodes should have bidirectional connectivity for network traffic. To check whether this is the case, open <node-ip-address-or-dns-name:rpc-port>/net_info
in your browser, for example, .
Accessing this endpoint via your browser would only work and/or you're accessing from an allowed origin. If this is not the case, you can also view the results for this endpoint from the same machine where your node service is running through the command line:
To determine whether this is true, from a machine other than your node, . Unlike ping
which uses ICMP packets, tcptraceroute
uses TCP, i.e., the actual protocol used for Tendermint P2P to see if the destination is reachable. Success or failure in connectivity using ping
doesn't prove whether your node is reachable, since firewalls along the path may have different rules for ICMP vs TCP.
Your firewall rules on the machine and/or infrastructure (cloud) provider could cause connectivity issues. Ideally, :
Outbound TCP traffic is the default mode on many systems, since the port through which traffic gets routed out is dynamically determined during TCP connection establishment. In some cases, e.g., when , you may require more complex configuration (outside the scope of this document).
Configuring OS-level firewalls is outside the scope of this document, but can generally be :
If ufw status
reports active, follow to allow traffic on the required ports (customise the ports to the required ports).
Outbound TCP traffic on port 853 (explicit rule not needed if you already allow TCP outbound on all ports): Modern DNS servers also allow , which secures the connection using TLS to the DNS server. This can prevent malicious DNS servers from intercepting queries and giving spurious responses.
Outbound TCP traffic on port 443 (explicit rule not needed if you already allow TCP outbound on all ports): Similar to above, this enables , if supported by your DNS resolver.
To check DNS resolution work, try to run a DNS query and see if it returns a response. The following command will use the dig
utility to look up and report your node's externally resolvable IP address via (note: only copy the command, not the sample output):
If your machine is provisioned with , you might find that the node struggles during times of high load, or slowly degrades over time. The minimum figures are recommended for a developer setup, rather than a production-grade node.