Disaster Recovery Plan for Cardano networks

Abstract

While the Cardano mainnet and other networks have proven to be highly resilient, it is necessary to proactivelyconsider the possible recovery mechanisms and procedures that may be required in the unlikelyevent of a major failure where the network is unable to recover itself.

This CIP considers three representative scenarios and addresses specific considerations relevantin each case:

Scenario 1 -Long-Lived Network Partition
Scenario 2 -Failure to Make Blocks for an Extended Period of Time
Scenario 3 -Bad Blocks Minted on Chain

To ensure successful recovery in the event of a chain failure, it's crucial to establish effectivecommunication channels and exercise recovery procedures in advance to familiarize the community andstake pool operators (SPOs) with the process.

This CIP is based on an earlier IOHK technical report that is referenced below, supplemented by internaldocumentation and discussions that have not been publicly released. It should be considered to be a livingdocument that is reviewed and revised on a regular basis.

Note that although the focus of disaster recovery is on Cardano mainnet, since this is the greatest riskof loss of funds, the recovery procedures are generic and apply to other Cardanonetworks, including SanchoNet, Preview, PreProd or private networks.Appropriate adjustments may need to be made to reflect differences in timing or other concerns.

Motivation: why is this CIP necessary?

This CIP is needed to familiarize stakeholders with the processes and procedures that should befollowed in the unlikely event that the Cardano mainnet, or another Cardano network, encountersa situation where the built-in on-chain recovery mechanisms fail.

Specification

While the exact recovery process will depend on the unique nature of the failure, there are three main scenarios we can consider.

Scenario 1: Long-Lived Network Partition

Ouroboros Praos is designed to cope with real-world networkingconditions, in which some nodes may temporarily be disconnected fromthe network. In this case, the network will continue to make blocks,perhaps at some lower chain density (reflecting the temporary loss ofstake to the network as a whole). As nodes rejoin the network, theywill then participate in normal block production once again. In thisway, the network remains resilient to changes in connectivity.

If many nodes become disconnected, the network could divide into twoor more completely disconnected parts. Each part of the network couldthen form its own chain, backed by the stake that is participating inits own partition. Under normal conditions, Praos will also deal withthis situation. When the partitioned group of nodes reconnects, thelongest chain will dominate, and the shorter chain will be discarded.The nodes on the shorter chain will automatically rollback to thepoint where the fork occurred, and then rejoin the main chain. Thisis perfectly normal. Such forks will typically last only a fewblocks.

However, in an extreme situation, the partition may persist beyond thePraos rollback limit ofk blocks (currently 2,160 blocks on mainnet).In this case, the nodes will not be able to rollback to rejoin the main chain, since thiswould violate the required Praos guarantees.

Remediations

Disconnected nodes must be reconnected to the main chain by their operators. This can be doneby truncating the local block database to a point before the chain fork and then resyncingagainst the main network, using thedb-truncator tool, for example.

Full node wallets can also be recovered in the same way, though this may require technicalskills that the end users do not possess. It may be easier, if slower, for them to simplyresynchronize their nodes from the start of the chain (i.e. from the genesis block).

Ouroboros Genesis provides additional resilience when recovering from long lived network partitions.In Praos nodes resyncing from a point before the chain fork could still in some cases follow thealternative chain (if it is the first one seen) and extra mechanisms may be needed to avoid thispossibility. In Praos, for example, this may require that all participants on the alternative chaintruncate the local block database prior to the partition being resolved. In Ouroboros Genesiswhen resyncing from a point before the chain fork, the chain selection rules will ensureselection of the correct path for the main chain assuming the partition has been resolved.

Alternative methods to resynchronise the node to the main chain mightinclude the use of Mithril or other signed snapshots. These wouldallow faster recovery. However, in this case, care needs to be takento achieve the correct balance of trust against speed of recovery.

Additional Effects on Cardano Users

Although block producing nodes will rejoin the main network following the remediationdescribed above, the blocks that they haveminted while they were disconnected will not be included in the mainchain. This may have real world effects that will not beautomatically remedied when the nodes rejoin the main chain. Forexample, transactions may have been processed that have significantreal world value, or assumptions may have been made about chains ofevidence/validity, or the timing of transactions. End users should beaware of the possibility and include provisions in their contracts tocover this eventuality. It may be necessary to resubmit some or all of thetransactions that were processed on the minority chain onto the main chain.To avoid unexpected effects, this should be done by the end users/applications, and notby block producers acting on their behalf.

If they are not observant, stake pools, full node wallets andother node users (e.g. explorers) could continue indefinitely on the minoritychain. Such users should take care to be aware of this situation andtake steps to rejoin the main chain as quickly as possible.A reliable and trusted public warning system should be considered that can alert usersand advise them on how to rejoin the main chain.

Timing Considerations

On Cardano mainnet, partitions of less than 2,160 blocks will automatically rejoin the main chain. With current Cardano mainnet settings, this representsa period of up to 12 hours during which automatic rollback will occur. If the partition exceeds 2,160 blocks, then theprocedure described above will be necessary to allow nodes to rejoin the main chain. Other Cardano networks may have differenttiming characteristics.

Scenario 2: Failure to Make Blocks for an Extended Period of Time

Ouroboros Praos requiresat least one block to be produced every3k/f slots. With the current Cardano mainnetsettings, that is a 36 hour period. Such an event is extremely unlikely, but if it were to happen then the networkwould be unable to make any further blocks.

Mitigation

It is recommended to monitor the chain for block production. If a low density period is observed, then block producersshould be notified, and efforts made to mint new blocks prior to the expiry of the3k/f window. If this is not possiblethen the remediation procedures should be followed.

Remediation

Identify a small group of block producing nodes that will be used to recover the chain. For Cardano mainnet, this group should havesufficient delegated stake to be capable of generating at least 9 blocks in a 36 hour window.It should be isolated from the rest of the network.The chain can then be recovered by resetting the wall clocks on the group of block producing nodes,restarting them from the last good block on the Cardano network, playing forward the chain productionat high speed (10x usual speed is recommended), while inserting new empty blocks at the slots whichare allocated to the block producers. The recovery nodes can then be restarted with normal settings, includingconnections to the network. Ouroboros Genesis then allows other nodes in the network to rapidly resynchronizewith the newly restored chain. This would leave one or more gaps in the chain, interspersed with empty blocks.

Rewards Donation by Recovery Block Producers

In order to avoid allegations of unfair behaviour, block producing nodes that are used to recover the network shoulddonate any rewards that they receive during recovery to the treasury.

Additional Effects on Cardano Users

Unlike Scenario 1, no transactions will be submitted that need to be resubmitted on the chain.Users will, however, experience an extended period during which the chain is unavailable.Cardano applications and contracts should be designed with this possibility in mind.Full node wallets and other node users should recover quickly once the network is restartedbut there may be a period of instability while network connections are re-establishedand the Ouroboros Genesis snapshot is distributed across all nodes.

Timing Considerations

The chain will tolerate a gap of up to3k/f slots (36 hours with current Cardano mainnet settings).A period of low chain density could have security implications that affect dynamic availabilityand leave open the possibility for future long range attacks. This may be particularlyrelevant should chain recovery be performed as described above (using less stake than is requiredfor an honest majority). To mitigate the presence of an extended period of low chain density we mayneed to make use of the lightweight checkpointing mechanism in Ouroborus Genesis. Alternatively, Mithrilcould also be used to provide certified snapshots to stake pools as a means to verify the correct state of the ledger.

The adoption of Mithril for fast bootstrapping by light clients and edge nodes should help to mitigate risksfor the types of users on the network that do not participate in consensus.

As described below, Ouroboros Genesis snapshots may also be useful as part of the recovery process.

Scenario 3: Bad Blocks Minted on Chain

In the event that a bad block was to be minted on-chain, then some or all validators might be unable to process the block.They would therefore stop, and be unable to restart. Wallet and other nodes might be unable to synchronise beyond thepoint of the bad block.

Remediation

Depending on the cause of the issue and its severity, alternative remediations might be possible.

Scenario 3.1: if some existing node versions were able to process the block, but others were not, thenthe chain would continue to grow at a lower chain density. SPOs would need to be persuaded to upgrade (or downgrade)to a suitable node version that would allow the chain to continue. The chain density would then gradually recover to its normal level.Other users would need to upgrade (or downgrade) to a version of the node that could follow the full chain.

Scenario 3.2: if no node version was able to process the block and agap of less than3k/f slots existed, then the chain could be rolledback immediately before the bad block was created, and nodesrestarted from this point. The chain would then grow as normal, with a small gap around the bad block.In this case, care would need to be taken that the rogue transaction was not accidentally reinserted into the chain.
This might involve clearing node mempools, applying filters on the transaction, or developing and deploying a new node version thatrejected the bad block.

Scenario 3.3: an alternative to rolling back would be to develop and deploy a "hot-fix" node that couldaccept the bad block, either as an exception, or as new acceptable behaviour.
Nodes would then be able to incorporate the bad block as part of the chain,minting new blocks as usual, or following the chain.In this case, the bad block would persist on-chain indefinitely and future nodeswould also need to accept the bad block. Such an approach is best used when the rejected block has behaviourthat was unanticipated, but which is benign in nature. This will leave no abnormal gaps in the chain.

Scenario 3.4: if more than3k/f slots have passed since the bad block was minted, then it will be necessary to roll back the chain immediatelyprior to the bad block as in Scenario 3.2, and then proceed as described for Scenario 2. As with Scenario 2, this will leavea series of gaps in the chain that are interspersed with empty blocks.

Timing Considerations

If more than3k/f slots have passed since the bad block was minted on-chain (36 hours with current Cardano mainnet settings),then a mix of recovery techniques will be needed, as described in Scenario 3.4. When deciding on the correct recoverytechnique for Scenarios 3.1-3.3, consideration should be given as to whether the recovery can be successfully completed before3k/f slotshave elapsed. In case of doubt, the procedure for Scenario 3.4 should be followed.

Using Ouroboros Genesis Snapshots

Any of the above conditions may result in a period of lower chain density. Theupdated consensus mechanism introduced in Ouroboros Genesis relies on makingchain density comparisons to assist a node when catching up with the network,in order to reduce the reliance on having trusted peers when syncing. Assuch, low-density periods pose a potential security risk for the future; theyare periods where a motivated adversary could perform a long-range attack bybuilding a higher density chain.

In order to mitigate this, Genesis introduces the concepts of lightweightcheckpoints. A lightweight checkpoint is effectively a block point - acombination of block number and hash - which can be distributed along with thenode. Unlike Mithril Snapshots (see below), Genesis lightweight snapshots are not assured by any committee - rather, they form part of the trusted codebase distributed with the node, or by other parties.

When syncing, a Genesis node will refuse to validate past the block number of any lightweight checkpoint if the chain does not contain the correct block at that point.

Genesis snapshots play two potential roles in disaster recovery:

In scenarios where the network is split, a lightweight snapshot could guidea node from the abandoned partition in connecting to the main partition. Ingeneral this should not be needed, however, since the main partition should winout in any Genesis density comparisons. This usage also falls closer toscenario 2, in that it relies on an external source imposing a chain selection,which must then be trusted by all parties.
Following a disaster recovery procedure, a sufficient number of blockscovering the low density period should be added to the list of lightweightcheckpoints. These would serve the purpose of preventing a subsequentlong-range attack.

Note that, in this second scenario, concerns about the legitimacy of thecheckpoint are much less salient. The checkpoint can be issued post disasterrecovery, at such a time where the points it contains are in the past, and areboth agreed upon and easy to verify for all honest parties.

Using Mithril Snapshots

Mithril is a stake-based threshold multi-signatures scheme. One of the applications of this protocol in Cardanois to create certified snapshots of the Cardano blockchain. Mithril snapshots allow nodes or applicationsto obtain a verified copy of the current state of the blockchain without having to download and verify the full history.

SPOs that participate in the Mithril network provide signed snapshots to a Mithril aggregator thatis responsible for collecting individual signatures from Mithril signers and aggregating them into a multi-signature.Using this capability, the Mithril aggregator can then provide certified snapshots of the Cardano blockchain thatcan potentially be used as a trusted source for recovery purposes.

Provided that it gains sufficient adoption on the Cardano network and thatsnapshots continue to be signed by an honest majority of stake poolsfollowing a chain recovery event, Mithril may therefore provide analternative solution to Ouroboros Genesis checkpoints as a way toverify the correct state of the ledger

Recommended Actions for Cardano mainnet

Monitor Cardano mainnet for periods of low density and take early action if an extended period is observed.
Identify a collection of block producer nodes that has sufficient stake to mint at least 9 blocks in any 36 hour window.
Set up emergency communication channels with stake pool operators and other community members.
Practice disaster recovery procedures on a regular basis.
Provide signed Mithril snapshots and a way for full node wallet users and others to recover from this snapshot.
Determine how to employ Ouroboros Genesis snapshots as part of the disaster recovery process

Community Engagement

One of the key requirements for successful disaster recovery will be proper engagement with the community.

Identify stake pool operators (SPOs) who can assist with disaster recovery
Discuss disaster recovery requirements with Intersect's Technical Working Groups and Security Council
Identify and establish the right communications channels with the community, including Intersect
Set up regular disaster recovery practice sessions

Rationale: how does this CIP achieve its goals?

This CIP outlines key disaster recovery scenarios that the Cardano community should understand to mitigatepotential network outages. As a living document, it will be regularly reviewed and updated to informstakeholders and encourage more detailed contingency planning. The CIP aims to facilitate discussions,establish recovery procedures, and encourage regular recovery practice exercises to ensure preparednessand validation of recovery actions in the event of an outage.

Path to Active

Acceptance criteria

The proposal has been reviewed by the community and sufficiently advertised on various channels.
- Intersect Technical Groups
- Intersect Discord Channels
- Cardano Forum
All major concerns or feedback have been addressed.

Implementation Plan

N/A

Change Log

Version	Date	Description
0.1	2024-08-30	Initial submitted version
0.2	2024-09-10	Revised version to emphasize genericity of recovery techniques
0.3	2024-09-18	Revised version following CIP editors meeting