Product Docs
Product DocsTechnical DocsLearning & GovernanceUseful Links
  • Product Docs
  • Node Docs
  • Learning Docs
  • ℹ️Getting Started
    • Product Overview
    • ➡️Get Started with cheqd Studio
      • 👉Set Up Your Account
      • 🗝️Create API Keys
      • 🪙Token Top Up
      • 🔄Advanced Configuration Options
    • ☑️Use Trust Registries for AI Agents
      • 🏗️Build an AI Agent Trust Registry
        • Setup AI Agent Trust Registry
          • Issue Verifiable Credentials to AI Agent
        • Setup and Configure MCP Server
          • Create AI Agent DID
          • Import Credential to AI Agent
          • Advanced functionality
            • Issue a Verifiable Credential
            • Verify a Credential
      • 🤝Validate AI Agent Trust Chain
  • 🟢Start using cheqd
    • 🆔Create DIDs and Identity Keys
      • Create a DID
      • Create Identity Keys
      • Create a Subject DID
      • Resolve a DID
      • Update a DID
      • Deactivate a DID
    • ✅Issue Credentials and Presentations
      • Issue a Verifiable Credential
      • Setup Verida Wallet
      • Verify a Verifiable Credential
      • Verify a Verifiable Presentation
      • Revoke a Verifiable Credential
      • Suspend or Unsuspend a Verifiable Credential
    • ♻️Charge for Verifiable Credentials
      • Understanding Credential Payments
        • Access Control Conditions
        • Privacy Considerations
      • Charge for Status List
      • Issue Credential with Encrypted Status List
      • Create Verifier Pays Issuer flow
      • Bulk Update or Rotate Encryption Keys
    • 🤝Build Trust Registries
      • Decentralized Trust Chains (DTCs)
        • Root Authorisations
        • RTAO -> TAO
        • TAO -> SubTAO
        • TAO -> Trusted Issuer (TI)
        • Referencing Trust Registry within a Verifiable Credential
      • Set up Trust Chain
        • Issue a Verifiable Accreditation
        • Verify a Verifiable Accreditation
      • Get Started with TRAIN
        • Deploy TRAIN and Anchor rTAO in DNS
        • Validate Trust Chain
    • 🎋Create Status Lists
      • Bitstring Status List
        • Create Bitstring Status List
        • Update Bitstring Status List
        • Check Bitstring Status List
        • Search Bitstring Status List
      • Token Status List
        • Create Token Status List
        • Update Token Status List
    • ↕️Create DID-Linked Resources
      • Understanding DID-Linked Resources
        • Context for developing DID-Linked Resources
        • Technical composition of DID-Linked Resources
        • Referencing DID-Linked Resources in VCs
      • Create a DID-Linked Resource
      • Search for DID-Linked Resources
  • 🛠️Integrate an SDK
    • Choosing the right SDK
    • 🍏Credo
      • Setup Credo Agent
      • Decentralized Identifiers (DIDs)
        • Create a DID
        • Update a DID
        • Deactivate a DID
      • DID-Linked Resources
        • Create a DID-Linked Resource
        • Resolve a DID-Linked Resource
        • Create an AnonCreds Schema
        • Create an AnonCreds Credential Definition
      • Verifiable Credentials and Presentations
        • AnonCreds
          • Issue a Verifiable Credential
          • Present a Verifiable Credential
        • JSON-LD
          • Issue a Verifiable Credential
          • Present a Verifiable Credential
        • SD-JWT VC
          • Issue a Verifiable Credential
          • Present a Verifiable Credential
    • 🍊ACA-Py
      • Setup ACA-Py Agent
      • Decentralized Identifiers (DIDs)
        • Create a DID
        • Update a DID
        • Deactivate a DID
      • DID-Linked Resources
        • Create AnonCreds Schema
        • Create AnonCreds Credential Definition
      • Verifiable Credentials and Presentations
        • AnonCreds
          • Issue a Verifiable Credential
          • Present a Verifiable Credential
          • Revoke a Verifiable Credential
        • JSON-LD
          • Issue a Verifiable Credential
          • Present a Verifiable Credential
    • 🍈Veramo
      • Setup Veramo CLI for cheqd
        • Troubleshooting Veramo CLI Setup
      • Decentralized Identifiers (DIDs)
        • Create a DID
        • Querying a DID
        • Update an existing DID
        • Deactivate a DID
        • Create an off-ledger holder DID
        • Managing Identity Keys
        • Troubleshooting
      • Verifiable Credentials and Presentations
        • Issue a Verifiable Credential
        • Verify a Verifiable Credential
        • Create a Verifiable Presentation
        • Verify a Verifiable Presentation
      • Credential Payments
        • Charge for Status List
        • Issue Credential with Encrypted Status List
        • Verifier pays Issuer
      • Bitstring Status List
        • Create Status List
        • Issuing a Verifiable Credential referencing Status List
      • DID-Linked Resources
        • Create a DID-Linked Resource
        • Create a new Resource version within existing Collection
    • 🫐Walt.id Community Stack
  • 🏗️Architecture
    • Architecture Decision Record (ADR) Process
    • List of ADRs
      • 🔵ADR 001: cheqd DID Method
      • 🟢ADR 002: DID-Linked Resources
      • 🟡ADR 003: DID Resolver
      • 🟠ADR 004: DID Registrar
      • 🟣ADR 005: DID Resolution & DID URL Dereferencing
  • 💫Advanced features and alternatives
    • ➡️DID Registrar
      • Setup DID Registrar
      • Create a DID
      • Create a DID-Linked Resource
    • ⬅️DID Resolver
      • Setup DID Resolver
    • ⚡AnonCreds Object Method
      • Schemas
      • Credential Definitions
      • Revocation Registry Definitions
      • Revocation Status Lists
    • 🌠Advanced Tooling
      • cheqd Cosmos CLI for identity
        • Create a DID
        • Update a DID
        • Deactivate a DID
        • Query a DID
        • Create a DID-Linked Resource
        • Update a DID-Linked Resource
      • Direct interaction with ledger code
      • VDR Tools CLI with cheqd (deprecated)
      • Demo Wallet for Identity Setup
  • ⚛️Network
    • Get started with cheqd Network
      • Identity Write Pricing
      • Comparison to Hyperledger Indy
    • ⏩Setup your Wallet
      • Setup Leap Wallet
        • Congifure cheqd Testnet for Leap
      • Setup Keplr Wallet
      • Migrate from Keplr to Leap Wallet
    • ↪️Useful Tools and APIs
      • Block Explorer
      • Testnet Faucet
      • Validator Status API
      • Cheqd x Cosmos Data APIs
      • Cosmos Airdrop Helpers
      • Cosmos Address Convertor
      • Ethereum Bridge
    • ⬆️Network Upgrades
      • 2021
        • 0.1.x
        • 0.2.x
        • 0.3.x
      • 2022
        • 0.4.x
        • 0.5.x
        • 0.6.x
      • 2023
        • 1.x
      • 2024
        • 2.x
        • 3.x
      • 2025
        • 3.1.x
        • 4.x
      • Root Cause Analysis of Outages
        • v1.x upgrade RCA
  • ⚖️Legal
    • License
    • Code of Conduct
    • Security Policy
  • 🆘Support
    • System Status
    • Discord
    • Bugs & Feature Requests
Powered by GitBook
LogoLogo

General

  • Website
  • Blog
  • Get $CHEQ

Product Docs

  • Product Docs
  • cheqd Studio
  • Creds.xyz
  • Bug/Feature Requests

Technical Docs

  • Node Docs
  • GitHub
  • Block Explorer

Learning Docs

  • Learning Docs
  • Governance Docs
  • Governance Forum
  • Governance Explorer
On this page
  • Introduction
  • Summary of events
  • Diagnosis
  • 1. Cosmos SDK v0.46.x upstream bug which requires “pruning = nothing”
  • 2. Leftover legacy cheqd-noded versions
  • 3. Corrupted database
  • How come we didn’t pick this up on testnet?
  • Resolution
  • Immediate resolution
  • Long-term solution
  • Lessons learnt
  • 1. Testnet extensive checks & snapshots
  • 2. Check other networks upgrade issues and downtimes in Cosmos ecosystem
  • 3. Consider how Cosmosvisor preparations need to change if manual upgrade is required
  • 4. Have a non-critical (high voting power) node take backups and snapshots
  • 5. Upgrades should NOT take place on Mondays
  • 6. Improve upgrade height / time forecasting calculations
  • 7. Create a dependent Internal & External services check-list
  • 8. Streamlining status updates, messaging & communications

Was this helpful?

Edit on GitHub
Export as PDF
  1. Network
  2. Network Upgrades
  3. Root Cause Analysis of Outages

v1.x upgrade RCA

This page offers a post mortem, root cause analysis, and lessons learnt from the v1.x cheqd network upgrade.

Last updated 1 month ago

Was this helpful?

Date: 30th January 2023 Status: Resolved

Introduction

In the lead up to cheqd v1.x mainnet upgrade, and during the eventual release and upgrade itself, we faced a number of challenges which resulted in a brief network halt of mainnet and the need for a fast-follow patched release.

As of the 8th February 2023, this patched release has now been applied across both testnet and mainnet validators, and the network is working as expected, with the issue identified during the upgrade now resolved.

This RCA provides a brief overview of the root cause of the issues identified, the fix itself released in v1.2.5, and the lessons learnt by the cheqd Product & Engineering team.

Summary of events

On Monday 30th January, the cheqd team initiated a major network upgrade, v1.x, which introduced identity transaction pricing, among other features and fixes to the network ().

Following the passing of the (proposal #12), At roughly 09:30 GMT, the network halted at block height 6,427,279, with the upgrade set to begin at height 6,427,280.

Over the subsequent 40 minutes, to 10:10 GMT, validators successfully upgraded to the new version, with consensus reached at 10:19 GMT. Shortly thereafter, following just ~10 blocks signed (), a number of validators reported errors and timeouts with their nodes, examples:

ERR Stopping peer for error err="pong timeout" module=p2p peer={"Data":{},"Logger":{\}}
Jan 30 13:01:32 ip-172-31-16-181 cosmovisor\[695507]: 1:01PM INF Timed out dur=3000 height=6427331 module=consensus round=0 step=3

Diagnosis

By the end of the day, we were able to identify three major three major types of issues we encountered:

1. Cosmos SDK v0.46.x upstream bug which requires “pruning = nothing”

The pruning issue was largely related to the DragonBerry patch, which was a fix of a high-risk security vulnerability (dubbed “Dragonberry”) related to IBC protocol / ICS23.

Within the patch a strict check was introduced on horizontal states in trees, and as a result, uneven heights, pruning + state-sync no longer were behaving as expected. After introducing the Dragonberry patch it’s been common that the store height does not equal the historic height. The store height not equalling the historic height is what caused our pruning issue. This is largely due to an known issue in the iavl state tree package + Cosmos SDK which occurs on store writes during upgrades.

2. Leftover legacy cheqd-noded versions

On some validators we found there was a v0.6.x binary in the /usr/bin folder, whereas under Cosmovisor that gets turned to a symlink from /usr/bin/cheqd-noded to the actual binary in /home/cheqd/.cheqdnode/cosmovisor.

3. Corrupted database

Some nodes had an issue executing the migration, where they seemed to run into an issue with the new group module that Cosmos SDK introduced in v0.46.x, although this seems to be an edge case.

How come we didn’t pick this up on testnet?

During the testnet upgrade we experienced a consensus error due to an issue with module versions among other issues (fortunately this is what testnet is for). As a result we started a fresh test network from block 0 + state export (testnet-6). With this fresh network, pruning was set to default ( pruning="default").

This is the default setting for every node. However, this default option also means that approximately 3.5 weeks of the state is kept which is not enough to catch the pruning issue early. Switching to more aggressive setting would’ve caught issues earlier, for example: pruning="custom", with:

pruning="custom"
pruning-keep-recent=50
pruning-interval=10 

Resolution

Immediate resolution

Immediate resolution to get the network running was offered to us by our validator community, as it was an issue ~10 Cosmos chains had faced when getting to the latest version.

In order to get the network back up and running, the validators were advised to simply switch their pruning parameters, setting them to pruning = “nothing”. Once consensus was reached with validators who had switched to this, the network was restored.

Long-term solution

Setting pruning to nothing was not an ideal solution, given it meant the network was not being pruned, resulting in an ever increasing chain size and increased storage requirements / higher costs for validators.

Lessons learnt

1. Testnet extensive checks & snapshots

Going forward we should endeavour to create as near an identical environment for testing upgrades on testnet as possible. While it may not always be possible, due to modules like IBC not being available on testnet, we should make sure that the delta between mainnet and testnet is detailed comprehensively in advance of upgrades to make sure any potential issue can be quickly diagnosed.

Where there is any risk vector of a consensus fault going into an upgrade, full snapshots should be taken of the network before any upgrades are attempted.

2. Check other networks upgrade issues and downtimes in Cosmos ecosystem

The issue we encountered during this upgrade could have been discovered in advance by aligning more proactively with other Cosmos chains. With a Validator community that spans far more than the cheqd Network, we should try and identify any potential issues experienced on other chains before bumping Cosmos SDK or IBC versions. Likewise, we should be circulating any issues we uncover. We’ll be more actively using this space within the product site to report on these going forward.

3. Consider how Cosmosvisor preparations need to change if manual upgrade is required

While we’re thrilled that the Cosmovisor automatic installation process largely worked as intended for this upgrade, however we still need to be mindful of how technical minutiae like symlinks are handled differently with manual upgrades. Having a clear overview of the delta between manual and Cosmovisor upgrade patterns will help isolate any potential issues in future upgrades.

4. Have a non-critical (high voting power) node take backups and snapshots

Reaching consensus initially at upgrade height took significantly longer than expected, due to the node managed by the cheqd team not being upgraded (cheqd node). This was because the cheqd node was taking a snapshot, at the halt height, to ensure there was a backup immediately before the upgrade to roll back to in case of a major upgrade failure. Going forward, rather than using this node for snapshots, a secondary node will be used to take snapshots to avoid delays.

5. Upgrades should NOT take place on Mondays

Generally we have intended to upgrade on Tuesdays or Wednesdays, however this time due to urgency of this upgrade and team vacation we opted for a Monday. This however doesn’t allow much time to get everyone prepared, which we’ll avoid going forward.

6. Improve upgrade height / time forecasting calculations

When submitting the mainnet upgrade proposal, like all other Cosmos-based network, we log the intended upgrade block height. This is calculated by taking the current height, and adding on the number of blocks required to reach the agreed upgrade time and date. The number of blocks to add depends on the average time per block. Unfortunately we were out with our estimated time, with the upgrade height being reached two hours earlier then planned. As a result, certain members of the team were unavailable at the upgrade time. In the future, we'll improve the accuracy of our estimations, ensure team availability is confirmed for a wider time window, and add sufficient buffer time to allow for block time being faster or slower for whatever reason.

7. Create a dependent Internal & External services check-list

With network upgrades on this scale, a number of changes are required in our internal and external services that depend on the core ledger. Following the upgrade we begun identifying which services were knocked out due to dependency changes, but this wasn’t done in a systematic, coordinated manner. Going forward, we’ll create a check-list of all internal and external services that need to be checked and updated following upgrade, to reduce downtime.

8. Streamlining status updates, messaging & communications

Although generally running two forums (Discord & Slack) has worked, during this upgrade there was a clear misalignment which could have been resolved if everyone was conversing in one place. Slack has remained as the dominant location for communications with SSI vendor validators, and Discord for the Cosmos-based validators. Going forward we will review the use of two communities, and decide on how to coordinate better, likely reducing to one group.

. . .

Thank you to our validators for your patience and support throughout. Fortunately this didn’t cause any significant downtime, however it could have been avoided through more stringent checks on testnet, and more communication with other Cosmos SDK chains.

Separately, we also experienced dependency breaks on a number of the internal and external servies which use the network APIs following changes to API paths within this upgrade.

After investigating a route to solve this, we found a patch authored by , released in late November 2023, which had already been used as a solution for other Cosmos SDK based networks facing the same issue.

We also failed to effectively utility to provide updates. This generally needs to be incorporated more into the Product & Engineering teams ways of working, and in particular at critical points like upgrades.

You can also read more about about plans for the year ahead in our

⚛️
⬆️
v1.x changelog
upgrade proposal
6,427,290
(dependency hell)
Chill Validator
status.cheqd.net
Our Product Vision for 2023 at cheqd 🔮.