v4.x Upgrade RCA (Testnet)

A Root Cause Analysis (RCA) and lessons learnt from the v4.x cheqd Testnet network upgrade.

1. Incident Summary

Network: cheqd testnet (cheqd-testnet-6)
Duration: ~30 hours (non-business)
Date: May 28 2025, 11:30 AM UTC - May 29 2025, ~17:30 PM UTC
Severity:
- SEV-1 – full chain halt (testnet)
  - Note on severity: Despite being a test network, we have denoted this the highest severity based on partner feedback that they require similar availability / uptime & performance from testnet and main-net due to clients running PoCs / pilots on testnet.
  - Date: May 28 2025, 11:30 AM BST - May 28 2025, 20:30 PM UTC
  - Duration: ~9 hours (non-business)
- SEV-2 – degraded read operations (testnet)
  - Note on severity: Despite being a test network, we have denoted this the highest severity based on partner feedback that they require similar availability / uptime & performance from testnet and main-net due to clients running PoCs / pilots on testnet.
  - Date: May 28 2025, 20:30 PM UTC - May 29 2025, 15:30 PM UTC
  - Duration: ~19 hours (non-business)
Impact: Testnet halted (SEV-1); no block production. Integration and test activities were disrupted (SEV-1) or partially degraded (SEV-2). Simulated user and development traffic failed or partially degraded (SEV-2). Furthermore, upon recovery, 1.5 days of state was lost due to the recovery method required.

2. What Happened?

A planned testnet upgrade v4.x was executed to apply and test a new version of the chain software. However, after the upgrade height, validators were unable to resume consensus due to a two-pronged issue.

Firstly, the halt (SEV-1) was caused by uneven state tree heights present in the network's historic state. Cosmos SDK implements IAVL trees for state versioning and keeping internal records of upgrade stores. IAVL has significantly improved on performance and reaching a state of uneven tree heights over the last iterations. Historically, unregistered module additions to the upgrade stores were not prohibited through strict validation, resulting in chains having to keep forks of both Cosmos SDK and its IAVL implementation; cheqd has historically retained patched versions of both (along with other downstream dependencies) retaining them up-to-date with newer versions and has successfully upgraded multiple times with these patches in place.

The halt was caused due to cheqd having to maintain patched sub-packages of Cosmos SDK (cosmossdk.io namespace) in addition to the already forked library in whole (github.com/cosmos/cosmos-sdk). We had to publish the patched sub-packages, incorporate them in the release explicitly and subsequently, cut the release, to proceed with the next iteration of the upgrade, which allowed us to restore consensus.

Secondly, whilst consensus was restored, we were informed by partners that pre-upgrade Decentralized Identifiers (DIDs) and any associated DID-Linked Resources (DLRs) were not readable / resolvable. This regression prompted an internal triage effort internally with the team focusing on:

investigating the root cause and,
restoring the network to a prior state (ideally, pre-submission of the software upgrade proposal) as part of the recovery plan.

We have successfully traced the unwanted behaviour to inconsistencies with Data Store keys due to introducing the new Cosmos SDK Collections API. In short, Key-Value pairs included in the new Collections API were partially incompatible with certain state-interacting keys pre-upgrade. We have been taking steps to address these incompatibilities and any state migrations necessary. Actions are one-off and we are not expecting further undesired behaviour.

3. Root Cause

The testnet halt and subsequent data regression stemmed from two distinct but related technical failures:

1. State Inconsistency from Uneven IAVL Tree Heights

Impact: Upgrade failed due to patch not being applied resulting in State Inconsistency due to Uneven IAVL Tree Heights.
Root Cause: Legacy inconsistencies in IAVL state due to unvalidated upgrade store entries and historical unregistered module additions.
Technical Detail: Cosmos SDK and IAVL maintain internal state versioning via Merkleized tree structures. While IAVL has improved handling of unbalanced trees, earlier network states accumulated inconsistencies not strictly validated at runtime.
Complications:
- cheqd had been maintaining forked and patched versions of both IAVL and Cosmos SDK, with additional changes required for internal subpackages in SDK v0.50.x. Newly bootstrapped networks are not facing this complication (since state trees start anew: uneven heights are addressed forwards only).
- Furthermore, cheqd testnet is the first environment where it was possible for this to be tested as fresh (clean slate) network deployments do not have sufficient state. cheqd has had multiple upgrades which included the patched versions of IAVL, none of which have previously caused issues.
- Finally, when attempting to recover the network, external (non-cheqd) validators were replaying a corrupted / malformed block which poisoned the fresh cheqd network. This meant that recoveries had to be attempted multiple times, losing valuable time, before resorting to a cheqd-only firewall protected network.
Outcome: Validators halted post-upgrade due to incompatible or unpatched submodule logic, until patched binaries were built, tested, and released.

2. Collections API Regression Affecting Key Decoding

Impact: Post-upgrade, existing Decentralized Identifiers (DIDs) and associated DID-Linked Resources became unreadable, due to partial incompatibility in key structure.
Root Cause: Mismatch between legacy key formats and the new encoding scheme introduced with Cosmos SDK’s Collections API.
Technical detail: The new Collections API internally redefines how key-value pairs are serialized and accessed. Legacy keys persisted using different encoding logic, leading to lookup failures across previously stored data.
Complications:
- When attempting to recover the network, external (non-cheqd) validators were still replaying a corrupted / malformed block which poisoned the fresh cheqd network despite repeated comms to testnet validators to shut down / restore their validators.. This meant that recoveries had to be attempted multiple times, losing valuable time, before resorting to a cheqd-only firewall protected network.
Outcome: Required development of custom migration logic and partial rollback/recovery to restore pre-upgrade data visibility.

4. Resolution

The cheqd engineering team implemented the following steps to recover the testnet and stabilize network operations:

Patch Preparation Patched subpackages of the Cosmos SDK were developed to resolve IAVL compatibility issues arising from legacy state inconsistencies. These were integrated into updated binaries and deployed internally to re-enable consensus.
Release Iteration Loop Multiple patch iterations (v4.x-develop.y) were required due to malformed or invalid blocks propagated by non-cheqd-controlled validators. These corrupted blocks caused repeated consensus halts, necessitating reissuance and refinement of the patched releases.
Firewall Isolation To regain operational stability for partners, cheqd introduced an application-level firewall around its own validator infrastructure. This isolation blocked external validator traffic and prevented propagation of malformed blocks, enabling internally controlled block production and testnet recovery without needing immediate coordination with the wider validator set.
Testnet Restoration Rather than attempting to patch regressions in-place—which risked introducing further inconsistencies and prolonging downtime—the team made the deliberate decision to restore the testnet from a snapshot taken prior to the software upgrade proposal submission. This approach ensured a clean, pre-upgrade state, avoiding compounding effects from corrupted or partially migrated state introduced after the upgrade. As a result however, all state changes made between the upgrade proposal submission and the restoration point were lost (~1.5 days of state). This included any transactions, resource writes, or DIDs created during that window. While this rollback came with the cost of discarding some post-upgrade data, it was the safest and most deterministic path to resuming stable operations without risking deeper protocol-level inconsistencies. The snapshot restoration also served as a reset point to allow regression fixes to be developed and validated separately (without further modifying the live testnet) before reintroducing changes in a controlled upgrade. Typically, state exports are taken 1 block prior to the upgrade height, allowing restoration from immediately before the upgrade. Whilst this was attempted during this upgrade, the state export failed due to a known issue with Interchain Queries module registration which this release (v4.x) would have addressed. It was not expected that this known issue would prevent state exports; once upgraded we do not expect to see future issues with state exports.
Regression Triage and Devnet Validation Functional regressions — including failures to resolve DIDs and DID-Linked Resources — were triaged separately. Fixes were developed and validated in a dedicated internal devnet, using production-like state to simulate real-world conditions. Only after full validation will these patches be promoted to testnet or a designated pre-prod environment. This release will be announced for coordination among validators in the upcoming days. This release has not been applied at the time of writing.
Foundation for Future Continuity This recovery experience highlighted the need for additional resilience strategies. The team has begun planning for future enhancements such as dedicated internal staging environment(s), and alternative availability patterns (e.g. read-only API failovers) to reduce downtime and operational impact during upgrades.

5. Timeline (UTC)

Time (UTC)

Event

May 28 2025, 11:30 AM UTC

Upgrade executed; network halts at upgrade height

[+1 hrs]

Consensus failure confirmed; initial triage begins

[+2 hrs]

Root cause identified: IAVL tree inconsistency and SDK patching needed

[+6 hrs]

First patched binary prepared and deployed internally

[+6.5 hrs]

Consensus fails again due to malformed blocks from external validators

[+7 hrs]

Firewall implemented to isolate cheqd-controlled validators

[+7.5 hrs]

Network resumes block production in isolated environment

[+16 hrs]

Regression in DID/resource resolution observed and logged

[+18 hrs]

Decision made to restore testnet from pre-upgrade snapshot

[+22 hrs]

Testnet restored and stabilized behind firewall

[+2 Business Days Later]

Regression fixes developed, tested on devnet, and staged for release

6. Impact Assessment

Area

Status & Notes

Testnet Availability

Halted for ~9 hours; resumed under firewall control

Validators

No external coordination; cheqd-controlled nodes resumed internally

DIDs & Resources

Regression caused unreadable records post-upgrade, 1.5 days of lost state due to inability to export state as per usual.

Partner Operations

API services and integration environments temporarily impacted

Data Consistency

Ensured via pre-upgrade snapshot restoration, 1.5 days of lost state due to inability to export state as per usual.

Downtime Communication

Limited to internal team; no wide partner broadcast during early recovery

7. Corrective Actions

1. Short-Term (Completed or In Progress)

Developed and deployed patched SDK subpackages to address IAVL-related consensus halts.
Logged and triaged post-upgrade regressions (e.g., unreadable DIDs and DID-Linked Resources) for later resolution.
Initiated off-chain devnet validation of regression fixes, decoupled from testnet recovery.
Implemented an application-level firewall to isolate cheqd validator infrastructure from malformed block propagation.
Restored testnet from a pre-upgrade snapshot, ensuring data consistency and operational continuity.
Restored non-cheqd validators to the test-network.

2. Long-Term (Recommended or Planned)

Establish a dedicated environment reflecting mainnet conditions for safe upgrade rehearsals.
Introduce shadow fork mechanisms or parallel upgrade environments to enable high-fidelity, rollback-free testing.
Automate export/import upgrade simulations in CI pipelines for production-like state validation.
Define firewalling and isolation protocols to protect against gossip-layer propagation issues as seen during recovery.
Update rollback playbook to include using snapshots and deterministic chain state recovery for when state export does not work.
Strengthen pre/during/post-upgrade communication processes, including partner notification cadences and internal escalation bridges.

8. Lessons Learned

1. Technical lessons learned

Environment segmentation is non-negotiable A single testnet serving as both a development playground and pre-prod network for clients introduces unacceptable risk. Without environment isolation, upgrade execution and confidence is diluted.
External validator behaviour can compromise recovery Consensus stability depends on the integrity of peer inputs. Malformed or stale blocks from third-party validators can cascade into recovery loops. Firewall isolation is essential for controlled restoration.
Snapshot recovery is often more efficient than live-state patching In complex regressions, rolling back to a clean snapshot and rebuilding from known good state is faster, safer, and more deterministic than attempting live remediation.
Separate regression recovery from outage recovery Attempting to solve regressions (e.g., Collections API key incompatibilities) during a live outage increases risk and response time. Isolated devnet testing enables safe and focused iteration.
Communication during incidents must be structured and proactive Downtime communication cannot rely on ad hoc updates. Timely coordination bridges and transparent partner-facing updates are key to managing expectations and maintaining trust.
Protocol maturity requires operational maturity As the chain, usage and adoption evolves, so must the processes and team structure supporting it. More complex modules, stricter upgrade sequencing, and deeper state interactions require stronger release engineering, staging environments, and automated safety nets.

2. Non-technical lesson learned: Communications to Testnet Users

A compounding issue on top of the technical RCA detailed above was the failure to notify the relevant users of testnet of the potential downtime ahead of attempting the upgrade.

Generally, for mainnet upgrades, we take a great amount of care to communicate the potential downtime for an upgrade, as well as the changes being made (and we had prepared sufficient mainnet comms). However, what we failed to take into consideration was that now, owing to the likes of Dock, Animo, DIDx and others maintaining pre-production implementations for their customers, testnet is almost as important as mainnet, with DIDs and DLRs being tested daily by our partners and their clients.

We have never historically communicated testnet upgrades in full to partners and users, and have primarily focussed on mainnet in the past. With testnet becoming increasingly used as a pseudo pre-production network, it is vital that we communicate all potential downtime and changes well in advance of a testnet upgrade.

Type of notification

Communicated?

Note

Testnet Validators

Yes

Proactively and retroactively communicated well

status.cheqd.net

Yes

Proactively and during outage maintained good notification on status page

Testnet Users (Partners and known users)

Failed to proactively communicate to key partners and users

Broader developer community

Failed to proactively notify broader user-set of potential downtime

1. Resolution

The cheqd product team will address the following items ahead of the next testnet upgrade:

Full testnet comms: Ensure that all known users of testnet, and the broader cheqd community, are fully aware of any potential downtime, and have been made well in-advance of the timelines.

Last updated 2 months ago

Was this helpful?