Product Docs
Product DocsTechnical DocsLearning & GovernanceUseful Links
  • Product Docs
  • Node Docs
  • Learning Docs
  • Node documentation for cheqd network
  • 👉Getting Started
    • Setup a new cheqd node
      • Pre-Requisites & Requirements
      • (Alternative) Install with Docker
      • Configure cosmovisor
    • Command Line usage
      • Manage keys
      • Manage accounts
      • Manage a node
      • Make transactions
      • Use fee abstraction
  • 🏦Validator Guides
    • Guide for validators
      • FAQs for validators
      • Optimising disk storage with pruning
      • Troubleshooting consistently high CPU/memory loads
      • Unjailing a jailed validator
      • Move validator to a different machine
      • Backup and restore node keys with Hashicorp Vault
    • Network-wide Software Upgrades
      • Upgrade Guides
        • Upgrade to v0.6.x
        • Upgrade to v2.x (with Ubuntu 24.04 LTS upgrade)
        • Upgrade to v3.x
        • Upgrade to v3.1.x
      • Creating a software upgrade proposal
      • Upgrade Process
  • 🧑‍💻Developing on cheqd
    • Building from source
      • Build with Docker
      • Run a localnet with Docker
    • Indexers
      • SubQuery
  • Architecture
    • Architecture Decision Record (ADR) Process
    • Identity ADRs
    • List of ADRs
      • ADR 001: Payment mechanism for issuing credentials
      • ADR 002: Importing/exporting mnemonic keys from Cosmos
      • ADR 003: Command Line Interface (CLI) tools
      • ADR 004: Token fractions
      • ADR 005: Genesis parameters
      • ADR 006: Community tax
      • ADR 007: Revocation registry
      • ADR 011: AnonCreds
      • ADR Template
  • 🤝Contributing to cheqd
    • Contributor Guide
    • License
    • Code of Conduct
    • Security Policy
  • 🆘Support
    • System Status
    • Slack Channel
    • Discord
Powered by GitBook
LogoLogo

General

  • Website
  • Blog
  • Get $CHEQ

Product Docs

  • Product Docs
  • cheqd Studio
  • Creds.xyz
  • Bug/Feature Requests

Technical Docs

  • Node Docs
  • GitHub
  • Block Explorer

Learning Docs

  • Learning Docs
  • Governance Docs
  • Governance Forum
  • Governance Explorer
On this page
  • Context
  • Diagnosing a CPU/memory leak
  • What does a CPU/memory leak look like?
  • What does a CPU leak look like?
  • Determining CPU/memory usage with command-line tools
  • Troubleshooting system clock synchronisation issues
  • Resolving system clock issues
  • NTP firewall rules
  • Troubleshooting node connectivity issues
  • Is your node's external address reachable?
  • Resolving connectivity issues due to blocked firewall ports
  • Connectivity issues due to blocked DNS traffic
  • Other troubleshooting steps
  • Is your machine underpowered?

Was this helpful?

Edit on GitHub
Export as PDF
  1. Validator Guides
  2. Guide for validators

Troubleshooting consistently high CPU/memory loads

Last updated 29 days ago

Was this helpful?

Context

Blockchain applications (especially when running validator nodes) are a-typical from "traditional" web server applications because their performance characteristics tend to be different in the way specified below:

  1. Tend to be more disk I/O heavy: Traditional web apps will typically offload data storage to persistent stores such as a database. In case of a blockchain/validator node, the database is on the machine itself, rather than offloaded to separate machine with a standalone engine. Many blockchains use for their local data copies. (In Cosmos SDK apps, such as cheqd, this is the , but can also be , , etc.) The net result is the same as if you were trying to run a database engine on a machine: the system needs to have fast read/write performance characteristics.

  2. Validator nodes cannot easily be auto-scaled: Many traditional applications can be horizontally (i.e., add more machines) or vertically (i.e., make current machine beefier) scaled. While this is possible for validator nodes, it must be done with extreme caution to ensure there aren't two instances of the same validator active simultaneously. This can be perceived by network consensus as a sign of compromised validator keys and lead to the . These concerns are less relevant for non-validating nodes, since they have a greater tolerance for missed blocks and can be scaled horizontally/vertically.

  3. Docker/Kubernetes setups are not recommended for validators (unless you really know what you're doing): Primarily due to the double-signing risk, it's (../setup-and-configure/docker.md) unless you have a strong DevOps practice. The other reason is related to the first point, i.e., a Docker setup adds an abstraction layer between the actual underlying file-storage vs the Docker volume engine. Depending on the Docker (or similar abstraction) storage drivers used, you may need to for optimal performance.

Diagnosing a CPU/memory leak

⚠️ Please ensure you are running the since they may contain fixes/patches that improve node performance.

What does a CPU/memory leak look like?

If you've got monitoring built in for your machine, a memory (RAM) leak would look like a graph where memory usage grows to 100%, falls off a cliff, grows to 100% again (the process repeats itself).

Normal memory usage may grow over time, but will not max out the available memory up to 100%. The graph below is taken from a server run by the cheqd team, over a 14-day period:

Figure 1: Graph showing normal memory usage on a cheqd-node server

What does a CPU leak look like?

A "CPU leak", i.e., where one or more process(es) consume increasing amounts of CPU is rarer, but could also happen if your machine has too few vCPUs and/or underpowered CPUs.

Figure 2: Graph showing normal CPU usage on a cheqd-node server

Figure 4: Graph showing CPU usage on Hetzner cloud, adding up to more than 100%

Check what accounting metric your monitoring tool uses to get a realistic idea of whether your CPU is overloaded or not.

Determining CPU/memory usage with command-line tools

If you don't have a monitoring application installed, you could use the built-in top or htop command.

Figure 2: Output of htop showing CPU and memory usage

htop is visually easier to understand than top since it breaks down usage per-CPU, as well as memory usage.

Troubleshooting system clock synchronisation issues

The net result of your system clock being out of sync is that your node:

  • Constantly tries to dial peers to try and fetch new blocks

  • Connection gets rejected by some/all of them

  • Keeps retrying the above until CPU/memory get exhausted, or the node process crashes

To check if your system clock is synchronised, use the following command (note: only copy the command, not the sample output):

root@hostname ~# timedatectl status
   Local time: Wed 2023-03-29 20:31:56 CEST
   Universal time: Wed 2023-03-29 18:31:56 UTC 
   RTC time: Wed 2023-03-29 18:31:57     
   Time zone: Europe/Berlin (CEST, +0200) 
   System clock synchronized: yes                         
   NTP service: active                      
   RTC in local TZ: no 

The timezone your machine is based in doesn't matter. You should check whether it reports System clock synchronized: yes and NTP service: active.

Resolving system clock issues

NTP firewall rules

You may also need to allow outbound UDP traffic on port 123 explicitly, depending on your firewall settings. This port is used by the Network Time Protocol (NTP) service.

Troubleshooting node connectivity issues

curl -v http://localhost:26657/net_info

The JSON output should be similar to below:

{"jsonrpc":"2.0","id":-1,"result":{"listening":true,"listeners":["Listener(@sentry1.ap.cheqd.net:26656)"],"n_peers":"47","peers":[{"node_info":{"protocol_version":{"p2p":"8","block":"11","app":"0"},"id":"c7b1c178adaf364917caaac67687051d1ed5bf53","listen_addr":"78.46.83.78:26656","network":"cheqd-mainnet-1","version":"0.34.24","channels":"40202122233038606100","moniker":"cstp-cheqd","other":{"tx_index":"on","rpc_address":"tcp://0.0.0.0:26657"}},"is_outbound":true,}]}}

Look for the n_peers value at the beginning: this shows the number of peers your node is connected. A healthy node would typically be connected to anywhere between 5-50 nodes.

Next, search the results for the term is_outbound. The number of matches for this term should exactly be the same as the value of n_peers, since this is printed once per peer. The value of is_oubound may either be true or false.

A healthy node should have a mix of is_outbound: true as well as is_outbound: false. If your node reports only one of these values, it's a strong indication that your node is unidirectionally connected/reachable, rather than bidirectionally reachable.

Unidirectional connectivity may cause your node to work overtime to stay synchronised with latest blocks on the network. You may fly by just fine - until there's a loss of connectivity to critical mass of peers and then your node goes offline.

Furthermore, your node might fetch the address book from seed nodes, and then try to resolve/contact them (and fail) due to connectivity issues.

Is your node's external address reachable?

Ideally, the IP address or DNS name set in external_address property in your config.toml file should be externally reachable.

Once you have tcptraceroute installed, from this external machine you can execute the following command in tcptraceroute <hostname> <port> format (note: only copy the actual command, not sample output):

user@hostname ~> sudo tcptraceroute seed1.eu.cheqd.net 26656
Selected device en0, address 192.168.4.42, port 53088 for outgoing packets
Tracing the path to seed1.eu.cheqd.net (116.202.176.48) on TCP port 26656, 30 hops max
1  192.168.4.1  3.049 ms  2.186 ms  5.693 ms
2  * * *
3  hari-core-2a-xe-806-0.network.virginmedia.net (94.173.50.205)  27.455 ms  16.619 ms  23.925 ms
4  * hari-core-2b-ae1-0.network.virginmedia.net (81.96.16.210) 33.225 ms  25.725 ms
5  * * *
6  * * *
7  tele-ic-7-ae2-0.network.virginmedia.net (62.253.175.34)  34.680 ms  19.670 ms  17.274 ms
8  ae15-0.lon10.core-backbone.com (80.255.14.105)  19.708 ms  26.629 ms  21.323 ms
9  ae6-2011.nbg40.core-backbone.com (80.255.14.246)  33.451 ms  30.159 ms  31.193 ms
10  core-backbone.hetzner.com (81.95.15.6)  33.430 ms  33.701 ms  31.949 ms
11  core11.nbg1.hetzner.com (213.239.229.161)  33.887 ms  34.907 ms  34.535 ms
12  spine11.cloud1.nbg1.hetzner.com (213.133.112.66)  66.511 ms  36.853 ms  32.539 ms
13  spine4.cloud1.nbg1.hetzner.com (213.133.108.150)  37.238 ms  43.259 ms  28.669 ms
14  * * *
15  15629.your-cloud.host (49.12.139.7)  27.337 ms  46.956 ms  33.213 ms
16  static.48.176.202.116.clients.your-server.de (116.202.176.48) [open]  39.811 ms  34.168 ms  1019.051 ms

A successful run would result in tcptraceroute reaching the destination server on the required port (e.g., 26656) and then hanging up. If the connection times out consistently at any of the hops, this could indicate there's a firewall / router in the path dropping or blocking connections.

Resolving connectivity issues due to blocked firewall ports

  1. Inbound TCP traffic on at least port 26656 (or custom P2P port)

  2. Optionally, inbound TCP traffic on other ports (RPC, gRPC, gRPC Web)

  3. Outbound TCP traffic on all ports

Router vs firewall issues

Besides firewalls, depending on your network infrastructure, your connectivity issue instead might lie in a router or Network Address Translation (NAT) gateway.

Operating system firewalls

In addition to infrastructure-level firewalls, Ubuntu machines also come with firewall on the machine itself. Typically, this is either disabled or set to allow all traffic by default.

root@hostname ~# ufw status
Status: inactive

Connectivity issues due to blocked DNS traffic

Another common reason for unidirectional node connectivity occurs when the correct P2P inbound/outbound traffic is allowed in firewalls, but DNS traffic is blocked by a firewall.

Your node needs the ability to lookup DNS queries to resolve nodes with DNS names as their external_address property to IP addresses, since other peers may advertise their addresses as a DNS name. Seed nodes set in config.toml are a common example of this, since these are advertised as DNS names.

Your node may still scrape by if DNS resolution is blocked, for example, by obtaining an address book from a peer that has already done DNS -> IP resolution. However, this approach can be liable to break down if the resolution is incorrect or entries outdated.

Firewall rules to allow DNS traffic

To enable DNS lookups, your infrastructure/OS-level firewalls should allow:

  1. Outbound UDP traffic on port 53: This is the most commonly-used port/protocol.

Checking whether DNS resolution works

root@hostname ~# dig +short txt ch whoami.cloudflare @1.1.1.1
"157.90.124.113"

If the lookup fails, that could indicate DNS queries or blocked, or there are no externally-resolvable IPs where the node can be reached.

Other troubleshooting steps

Is your machine underpowered?

Typically, this problem is seen if you (non-exhaustive list):

  • Have only one CPU (bump to at least two CPU)

  • Only 1-2 GB of RAM (bump to at least 4 GB)

Most cloud providers should allow dynamically scaling these two factors without downtime. Monitor - especially over a period of days/weeks - whether this improves the situation or not. If the CPU/memory load behaviour remains similar, that likely indicates the issue is different.

Scaling CPU/memory without downtime may be different you're running a physical machine, or if your cloud provider doesn't support it. Please follow the guidance of those hosting platforms.

There's a catch here: depending on your monitoring tool, "100% CPU" could be measured differently! The graph above is from .

Other monitoring tools, such as , count each CPU as "100%", thus making the overall figure displayed in the graph (shown below) add up to number of CPUs x 100%.

, regardless of the CPU usage.

Unfortunately, this only provides the real-time usage, rather than historical usage over time. Historical usage typically requires an external application, which many cloud providers provide, or through 3rd party monitoring tools such as , etc.

, in case you already have a Prometheus instance you can use or comfortable with using the software. This can allow alerting based on actual metrics emitted by the node, rather than just top-level system metrics which are a blunt instrument / don't go into detail.

If your , this could cause Tendermint peer-to-peer connections to be rejected. This is similar to in a normal browser when accessing secure (HTTPS) sites.

If either of these are not true, chances are that your system clock has fallen out of sync, and may be the root cause of CPU/memory leaks. Follow to resolve the issue, and then monitor whether it fixes high utilisation.

Properly-configured nodes should have bidirectional connectivity for network traffic. To check whether this is the case, open <node-ip-address-or-dns-name:rpc-port>/net_info in your browser, for example, .

Accessing this endpoint via your browser would only work and/or you're accessing from an allowed origin. If this is not the case, you can also view the results for this endpoint from the same machine where your node service is running through the command line:

To determine whether this is true, from a machine other than your node, . Unlike ping which uses ICMP packets, tcptraceroute uses TCP, i.e., the actual protocol used for Tendermint P2P to see if the destination is reachable. Success or failure in connectivity using ping doesn't prove whether your node is reachable, since firewalls along the path may have different rules for ICMP vs TCP.

Your firewall rules on the machine and/or infrastructure (cloud) provider could cause connectivity issues. Ideally, :

Outbound TCP traffic is the default mode on many systems, since the port through which traffic gets routed out is dynamically determined during TCP connection establishment. In some cases, e.g., when , you may require more complex configuration (outside the scope of this document).

Configuring OS-level firewalls is outside the scope of this document, but can generally be :

If ufw status reports active, follow to allow traffic on the required ports (customise the ports to the required ports).

Outbound TCP traffic on port 853 (explicit rule not needed if you already allow TCP outbound on all ports): Modern DNS servers also allow , which secures the connection using TLS to the DNS server. This can prevent malicious DNS servers from intercepting queries and giving spurious responses.

Outbound TCP traffic on port 443 (explicit rule not needed if you already allow TCP outbound on all ports): Similar to above, this enables , if supported by your DNS resolver.

To check DNS resolution work, try to run a DNS query and see if it returns a response. The following command will use the dig utility to look up and report your node's externally resolvable IP address via (note: only copy the command, not the sample output):

If your machine is provisioned with , you might find that the node struggles during times of high load, or slowly degrades over time. The minimum figures are recommended for a developer setup, rather than a production-grade node.

🏦
DigitalOcean's monitoring tools, which counts the sum of all CPU capacity as "100%"
Hetzner Cloud's
Load average is another useful measure of the responsiveness of a machine
Datadog
Tendermint / Cosmos SDK also provides a Prometheus metrics interface
system clock is out of synchronisation
how SSL/TLS connections can get rejected with a "handshake error"
this guide on setting time synchronisation in Ubuntu
rpc.cheqd.net/net_info
if traffic to your RPC port is allowed through your firewall
install tcptraceroute
your firewall rules should allow
using a NAT gateway in AWS
checked/configured using the ufw utility
this guide on configuring firewall rules using ufw
DNS-over-TLS
DNS-over-HTTPS
Cloudflare's 1.1.1.1 DNS resolver
the bare minimum of CPU and RAM
LevelDB
Golang implementation of LevelDB
C-implementation of LevelDB
RocksDB
node being jailed for double-signing blocks
tune the storage/volume engine options
latest stable release of cheqd-node
Graph showing memory/RAM usage on a cheqd-node server
Graph showing normal CPU usage on a cheqd-node server
Graph showing CPU usage on Hetzner, adding up to more than 100%
Output of htop command showing CPU and memory usage