From fffc2a6ede81f9933efd7f447d38680e1a9e0278 Mon Sep 17 00:00:00 2001 From: Erik Grinaker Date: Thu, 16 Sep 2021 17:43:03 +0200 Subject: [PATCH] docs: add `README` for `snapshots` package (#10120) ## Description Closes: #10085 Adds some broad documentation for the `snapshots` package. --- ### Author Checklist *All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.* I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [x] added `!` to the type prefix if API or client breaking change - [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [x] provided a link to the relevant issue or specification - [x] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [x] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [x] added a changelog entry to `CHANGELOG.md` - [x] included comments for [documenting Go code](https://blog.golang.org/godoc) - [x] updated the relevant documentation or specification - [x] reviewed "Files changed" and left comments if necessary - [x] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable) --- snapshots/README.md | 236 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 236 insertions(+) create mode 100644 snapshots/README.md diff --git a/snapshots/README.md b/snapshots/README.md new file mode 100644 index 0000000000..6abb497b88 --- /dev/null +++ b/snapshots/README.md @@ -0,0 +1,236 @@ +# State Sync Snapshotting + +The `snapshots` package implements automatic support for Tendermint state sync +in Cosmos SDK-based applications. State sync allows a new node joining a network +to simply fetch a recent snapshot of the application state instead of fetching +and applying all historical blocks. This can reduce the time needed to join the +network by several orders of magnitude (e.g. weeks to minutes), but the node +will not contain historical data from previous heights. + +This document describes the Cosmos SDK implementation of the ABCI state sync +interface, for more information on Tendermint state sync in general see: + +* [Tendermint Core State Sync for Developers](https://medium.com/tendermint/tendermint-core-state-sync-for-developers-70a96ba3ee35) +* [ABCI State Sync Spec](https://docs.tendermint.com/master/spec/abci/apps.html#state-sync) +* [ABCI State Sync Method/Type Reference](https://docs.tendermint.com/master/spec/abci/abci.html#state-sync) + +## Overview + +For an overview of how Cosmos SDK state sync is set up and configured by +developers and end-users, see the +[Cosmos SDK State Sync Guide](https://blog.cosmos.network/cosmos-sdk-state-sync-guide-99e4cf43be2f). + +Briefly, the Cosmos SDK takes state snapshots at regular height intervals given +by `state-sync.snapshot-interval` and stores them as binary files in the +filesystem under `/data/snapshots/`, with metadata in a LevelDB database +`/data/snapshots/metadata.db`. The number of recent snapshots to keep are given by +`state-sync.snapshot-keep-recent`. + +Snapshots are taken asynchronously, i.e. new blocks will be applied concurrently +with snapshots being taken. This is possible because IAVL supports querying +immutable historical heights. However, this requires `state-sync.snapshot-interval` +to be a multiple of `pruning-keep-every`, to prevent a height from being removed +while it is being snapshotted. + +When a remote node is state syncing, Tendermint calls the ABCI method +`ListSnapshots` to list available local snapshots and `LoadSnapshotChunk` to +load a binary snapshot chunk. When the local node is being state synced, +Tendermint calls `OfferSnapshot` to offer a discovered remote snapshot to the +local application and `ApplySnapshotChunk` to apply a binary snapshot chunk to +the local application. See the resources linked above for more details on these +methods and how Tendermint performs state sync. + +The Cosmos SDK does not currently do any incremental verification of snapshots +during restoration, i.e. only after the entire snapshot has been restored will +Tendermint compare the app hash against the trusted hash from the chain. Cosmos +SDK snapshots and chunks do contain hashes as checksums to guard against IO +corruption and non-determinism, but these are not tied to the chain state and +can be trivially forged by an adversary. This was considered out of scope for +the initial implementation, but can be added later without changes to the +ABCI state sync protocol. + +## Snapshot Metadata + +The ABCI Protobuf type for a snapshot is listed below (refer to the ABCI spec +for field details): + +```protobuf +message Snapshot { + uint64 height = 1; // The height at which the snapshot was taken + uint32 format = 2; // The application-specific snapshot format + uint32 chunks = 3; // Number of chunks in the snapshot + bytes hash = 4; // Arbitrary snapshot hash, equal only if identical + bytes metadata = 5; // Arbitrary application metadata +} +``` + +Because the `metadata` field is application-specific, the Cosmos SDK uses a +similar type `cosmos.base.snapshots.v1beta1.Snapshot` with its own metadata +representation: + +```protobuf +// Snapshot contains Tendermint state sync snapshot info. +message Snapshot { + uint64 height = 1; + uint32 format = 2; + uint32 chunks = 3; + bytes hash = 4; + Metadata metadata = 5 [(gogoproto.nullable) = false]; +} + +// Metadata contains SDK-specific snapshot metadata. +message Metadata { + repeated bytes chunk_hashes = 1; // SHA-256 chunk hashes +} +``` + +The `format` is currently `1`, defined in `snapshots.types.CurrentFormat`. This +must be increased whenever the binary snapshot format changes, and it may be +useful to support past formats in newer versions. + +The `hash` is a SHA-256 hash of the entire binary snapshot, used to guard +against IO corruption and non-determinism across nodes. Note that this is not +tied to the chain state, and can be trivially forged (but Tendermint will always +compare the final app hash against the chain app hash). Similarly, the +`chunk_hashes` are SHA-256 checksums of each binary chunk. + +The `metadata` field is Protobuf-serialized before it is placed into the ABCI +snapshot. + +## Snapshot Format + +The current version `1` snapshot format is a zlib-compressed, length-prefixed +Protobuf stream of `cosmos.base.store.v1beta1.SnapshotItem` messages, split into +chunks at exact 10 MB byte boundaries. + +```protobuf +// SnapshotItem is an item contained in a rootmulti.Store snapshot. +message SnapshotItem { + // item is the specific type of snapshot item. + oneof item { + SnapshotStoreItem store = 1; + SnapshotIAVLItem iavl = 2 [(gogoproto.customname) = "IAVL"]; + } +} + +// SnapshotStoreItem contains metadata about a snapshotted store. +message SnapshotStoreItem { + string name = 1; +} + +// SnapshotIAVLItem is an exported IAVL node. +message SnapshotIAVLItem { + bytes key = 1; + bytes value = 2; + int64 version = 3; + int32 height = 4; +} +``` + +Snapshots are generated by `rootmulti.Store.Snapshot()` as follows: + +1. Set up a `protoio.NewDelimitedWriter` that writes length-prefixed serialized + `SnapshotItem` Protobuf messages. + 1. Iterate over each IAVL store in lexicographical order by store name. + 2. Emit a `SnapshotStoreItem` containing the store name. + 3. Start an IAVL export for the store using + [`iavl.ImmutableTree.Export()`](https://pkg.go.dev/github.com/tendermint/iavl#ImmutableTree.Export). + 4. Iterate over each IAVL node. + 5. Emit a `SnapshotIAVLItem` for the IAVL node. +2. Pass the serialized Protobuf output stream to a zlib compression writer. +3. Split the zlib output stream into chunks at exactly every 10th megabyte. + +Snapshots are restored via `rootmulti.Store.Restore()` as the inverse of the above, using +[`iavl.MutableTree.Import()`](https://pkg.go.dev/github.com/tendermint/iavl#MutableTree.Import) +to reconstruct each IAVL tree. + +## Snapshot Storage + +Snapshot storage is managed by `snapshots.Store`, with metadata in a `db.DB` +database and binary chunks in the filesystem. Note that this is only used to +store locally taken snapshots that are being offered to other nodes. When the +local node is being state synced, Tendermint will take care of buffering and +storing incoming snapshot chunks before they are applied to the application. + +Metadata is generally stored in a LevelDB database at +`/data/snapshots/metadata.db`. It contains serialized +`cosmos.base.snapshots.v1beta1.Snapshot` Protobuf messages with a key given by +the concatenation of a key prefix, the big-endian height, and the big-endian +format. Chunk data is stored as regular files under +`/data/snapshots///`. + +The `snapshots.Store` API is based on streaming IO, and integrates easily with +the `snapshots.types.Snapshotter` snapshot/restore interface implemented by +`rootmulti.Store`. The `Store.Save()` method stores a snapshot given as a +`<- chan io.ReadCloser` channel of binary chunk streams, and `Store.Load()` loads +the snapshot as a channel of binary chunk streams -- the same stream types used +by `Snapshotter.Snapshot()` and `Snapshotter.Restore()` to take and restore +snapshots using streaming IO. + +The store also provides many other methods such as `List()` to list stored +snapshots, `LoadChunk()` to load a single snapshot chunk, and `Prune()` to prune +old snapshots. + +## Taking Snapshots + +`snapshots.Manager` is a high-level snapshot manager that integrates a +`snapshots.types.Snapshotter` (i.e. the `rootmulti.Store` snapshot +functionality) and a `snapshots.Store`, providing an API that maps easily onto +the ABCI state sync API. The `Manager` will also make sure only one operation +is in progress at a time, e.g. to prevent multiple snapshots being taken +concurrently. + +During `BaseApp.Commit`, once a state transition has been committed, the height +is checked against the `state-sync.snapshot-interval` setting. If the committed +height should be snapshotted, a goroutine `BaseApp.snapshot()` is spawned that +calls `snapshots.Manager.Create()` to create the snapshot. + +`Manager.Create()` will do some basic pre-flight checks, and then start +generating a snapshot by calling `rootmulti.Store.Snapshot()`. The chunk stream +is passed into `snapshots.Store.Save()`, which stores the chunks in the +filesystem and records the snapshot metadata in the snapshot database. + +Once the snapshot has been generated, `BaseApp.snapshot()` then removes any +old snapshots based on the `state-sync.snapshot-keep-recent` setting. + +## Serving Snapshots + +When a remote node is discovering snapshots for state sync, Tendermint will +call the `ListSnapshots` ABCI method to list the snapshots present on the +local node. This is dispatched to `snapshots.Manager.List()`, which in turn +dispatches to `snapshots.Store.List()`. + +When a remote node is fetching snapshot chunks during state sync, Tendermint +will call the `LoadSnapshotChunk` ABCI method to fetch a chunk from the local +node. This dispatches to `snapshots.Manager.LoadChunk()`, which in turn +dispatches to `snapshots.Store.LoadChunk()`. + +## Restoring Snapshots + +When the operator has configured the local Tendermint node to run state sync +(see the resources listed in the introduction for details on Tendermint state +sync), it will discover snapshots across the P2P network and offer their +metadata in turn to the local application via the `OfferSnapshot` ABCI call. + +`BaseApp.OfferSnapshot()` attempts to start a restore operation by calling +`snapshots.Manager.Restore()`. This may fail, e.g. if the snapshot format is +unknown (it may have been generated by a different version of the Cosmos SDK), +in which case Tendermint will offer other discovered snapshots. + +If the snapshot is accepted, `Manager.Restore()` will record that a restore +operation is in progress, and spawn a separate goroutine that runs a synchronous +`rootmulti.Store.Restore()` snapshot restoration which will be fed snapshot +chunks until it is complete. + +Tendermint will then start fetching and buffering chunks, providing them in +order via ABCI `ApplySnapshotChunk` calls. These dispatch to +`Manager.RestoreChunk()`, which passes the chunks to the ongoing restore +process, checking if errors have been encountered yet (e.g. due to checksum +mismatches or invalid IAVL data). Once the final chunk is passed, +`Manager.RestoreChunk()` will wait for the restore process to complete before +returning. + +Once the restore is completed, Tendermint will go on to call the `Info` ABCI +call to fetch the app hash, and compare this against the trusted chain app +hash at the snapshot height to verify the restored state. If it matches, +Tendermint goes on to process blocks.