v4.x Upgrade RCA (Testnet Partial Outage During Retrial)

A Root Cause Analysis (RCA) and lessons learnt from the v4.x cheqd Testnet network upgrade retrial.

1. Incident Summary

  • Network: cheqd testnet (cheqd-testnet-6)

  • Date: July 2 2025, 9:30 AM UTC - July 2 2025, 12:30 AM UTC

  • Duration: ~3 hours (non-business)

  • Severity:

    • SEV-2 – degraded network operations & delayed upgrade checkout

  • Impact: During the retrial of the previously failed upgrade, block production temporarily stalled, and technical checkout was delayed due to a malformed block from a third-party validator and repeated failures of the cheqd AP node. Recovery required coordinated validation and infrastructure restart before upgrade completion.

2. What Happened?

As part of a coordinated retrial of the cheqd testnet upgrade (v4.x), the upgrade and migrations were executed successfully and block production began as expected. However, technical validation failed due to the propagation of a malformed block by a non-cheqd validator.

cheqd infrastructure was firewall-isolated, which restored control, but during this period the cheqd application (Asia-Pacific region) node crashed unexpectedly. After restarting the node, block production resumed, but issues resolving historical identity transactions suggested lingering effects from the malformed block.

With partner approval, a second upgrade attempt was made. This succeeded in reinitiating block production, but the AP node failed again. Only after a full server reboot did the node stabilize. Final technical validation passed and the upgrade was declared successful, with the firewall isolation remaining in place.

3. Root Cause

The delay and partial outage during the upgrade retrial were caused by:

1. Corrupted state on several internal nodes and external validators

Impact: Block production stalled and identity lookups failed during technical validation.

Root Cause: Invalid genesis.json file present on affected nodes.

Technical Detail: In Cosmos SDK v0.50.x and above, there's a new server architecture introduced, which is shifting the genesis intialization, parsing, validation and also, the module state initialization from consensus layer to the app layer.

This also means that at each node startup, some parameters from genesis.json are taken into account when calculating expected App Hash.

In our case, due to undiscovered bug in interactive installer, the incorrect parameter was chain_id , which was by default set to cheqd-mainnet-1 , due to missing --chain-id for cheqd-noded init command.

This didn't affect nodes with correct genesis.json file, which are basically all nodes that were up and running since network restart to cheqd-testnet-6, and those who manually downloaded correct genesis.json from our GitHub repository.

Outcome: Firewall isolation of cheqd validators was re-applied to preserve clean state and prevent further propagation of malformed blocks.

Remediation: To prevent this issue from happening in the future and also any other potential issues with genesis file, we modified our interactive installer to add correct chain-id flag, but also, we added a step which makes sure that same genesis.json from GitHub repository is downloaded on all nodes during installation.

Also, we added an option to check genesis.json integrity via installer, so existing node operators can make sure they have correct version downloaded. This should also prevent the same issue happening on mainnet.

2. AP Node Instability

Impact: cheqd AP validator node got stuck twice, blocking technical checkout.

Root Cause: Pruning issue, most likely caused by IAVL fork.

Technical Details: The AP node got stuck silently during both upgrade attempts. Restart attempts failed until the entire server was rebooted.

The root cause seems to be an issue with pruning, as we managed to replicate the same issue on one of our seeds, which had pruning = "everything" strategy applied.

Why this happened only on AP validator, and not on the EU one? Simply because we had different pruning strategy applied to that node (pruning = "custom"), which was also aggressive.

Here is one of the possible explanations we got from Vitwit's devs:

Regarding the pruning issue. whenever a major migration happens during an upgrade the block after the upgrade is usually very heavy. It takes a lot of time and server resources to commit it into the db. Because the block is so heavy it takes a lot of time to prune it from the db as well.

That's what happened here. Nodes with aggressive custom pruning or everything pruning hit this issue early. Once that block gets pruned regular node operation resumes. Nodes with default pruning will have this issue as well but they'll hit it later. We can't exactly reproduce this error locally as a lot of state is required to be in the db pre migration.

This explanation makes perfect sense, because after taking some time and a couple of restarts, eventually, node recovered on its own. Also, during the initial recovery of seed1-ap, as soon as we disabled pruning, node bootstrapped without any issues.

On the other hand, our seed1-ap took around 1 hour and 1-2 force restarts to resume with regular node operations.

On seed1-eu, we're still getting this error, which indicates there's still an issue with pruning:

Jul 14 18:03:42 testnet-seed1-eu cosmovisor[1272032]: {"level":"error","module":"server","key":"KVStoreKey{0xc0023d1680, distribution}","err":"Value missing for key [17 21 213 182 6 156 108 205 61 241 54 89 206 116 158 171 187 69 158 6 99 93 113 52 168 161 67 223 127 79 183 230] corresponding to nodeKey 6e1115d5b6069c6ccd3df13659ce749eabbb459e06635d7134a8a143df7f4fb7e6","time":"2025-07-14T18:03:42+02:00","message":"failed to prune store"}

Outcome: Manual server reboot restored full node functionality and enabled upgrade completion.

4. Resolution

The cheqd engineering team implemented the following steps to recover from the degraded state:

  • Snapshots were taken in advance of the upgrade as a rollback safeguard.

  • Upgrade and migrations were performed successfully.

  • First technical checkout failed due to malformed block propagation; cheqd validators isolated.

  • AP node crash resolved via restart, but identity resolution still failed.

  • Partner consensus obtained to retry upgrade, with full awareness of rollback plan.

  • Second upgrade attempt initiated.

  • AP node failed again; resolved only after full server reboot.

  • Final technical validation succeeded.

  • Network remains firewalled to protect block propagation during post-upgrade monitoring.

5. Timeline (UTC)

Time (UTC)
Event

July 2 2025, 11:00 AM UTC

Snapshot generated at upgrade height - 1(10 minutes)

+10 minutes

Upgrade retrial executed; migrations run successfully

+30 minutes

Block production begins; technical checkout initiated

+50 minutes

Technical checkout fails; malformed block identified from external validator

+70 minutes

Cheqd validators firewalled; AP node crashes unexpectedly

+90 minutes

AP node restarted; block production resumes; identity resolution issues detected

+120 minutes

Partner coordination completed (DIDx, Dock, Anonyme); decision made to retry upgrade

+130 minutes

Upgrade re-attempted; AP node fails again

+145 minutes

Full server reboot performed ; AP node stabilises; technical checkout initiated

+170 minutes

Technical check-out passes; upgrade confirmed successfully

6. Impact Assessment

Area
Status & Notes

Testnet Availability

Temporarily degraded; resolved after AP node reboot

Validators

Firewall-isolated cheqd validators preserved block integrity

DIDs & Resources

Temporary resolution issues; no state loss

Partner Operations

Coordination enabled retry with no rollback

Data Consistency

Maintained throughout; no restoration required

Downtime Communication

Strong coordination with key partners during retry decision

7. Corrective Actions

1. Short-Term (Completed or In Progress)

8. Lessons Learned

  • Firewall isolation is essential to protect validator consensus from malicious or malformed peer input if malformed.

  • Unexpected infra-level failures (e.g., node crashes) can delay upgrades even after consensus is restored.

  • Partner comms and validation ahead of retry allowed faster, risk-aligned decision-making.

  • Snapshots provide strategic flexibility for deterministic rollback, even if unused.

  • Node observability and automated recovery tooling are critical for ensuring upgrade reliability in the face of partial outages.

Last updated

Was this helpful?