diff --git a/statediff/builder_test.go b/statediff/builder_test.go index d62bdc766..b2a9f65e9 100644 --- a/statediff/builder_test.go +++ b/statediff/builder_test.go @@ -32,7 +32,6 @@ import ( sdtypes "github.com/ethereum/go-ethereum/statediff/types" ) -// TODO: add test that filters on address var ( contractLeafKey []byte emptyDiffs = make([]sdtypes.StateNode, 0) diff --git a/statediff/doc.go b/statediff/doc.go deleted file mode 100644 index 184bc242c..000000000 --- a/statediff/doc.go +++ /dev/null @@ -1,57 +0,0 @@ -// Copyright 2019 The go-ethereum Authors -// This file is part of the go-ethereum library. -// -// The go-ethereum library is free software: you can redistribute it and/or modify -// it under the terms of the GNU Lesser General Public License as published by -// the Free Software Foundation, either version 3 of the License, or -// (at your option) any later version. -// -// The go-ethereum library is distributed in the hope that it will be useful, -// but WITHOUT ANY WARRANTY; without even the implied warranty of -// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -// GNU Lesser General Public License for more details. -// -// You should have received a copy of the GNU Lesser General Public License -// along with the go-ethereum library. If not, see . - -/* -Package statediff provides an auxiliary service that processes state diff objects from incoming chain events, -relaying the objects to any rpc subscriptions. - -This work is adapted from work by Charles Crain at https://github.com/jpmorganchase/quorum/blob/9b7fd9af8082795eeeb6863d9746f12b82dd5078/statediff/statediff.go - -The service is spun up using the below CLI flags ---statediff: boolean flag, turns on the service ---statediff.streamblock: boolean flag, configures the service to associate and stream out the rest of the block data with the state diffs. ---statediff.intermediatenodes: boolean flag, tells service to include intermediate (branch and extension) nodes; default (false) processes leaf nodes only. ---statediff.watchedaddresses: string slice flag, used to limit the state diffing process to the given addresses. Usage: --statediff.watchedaddresses=addr1 --statediff.watchedaddresses=addr2 --statediff.watchedaddresses=addr3 - -If you wish to use the websocket endpoint to subscribe to the statediff service, be sure to open up the Websocket RPC server with the `--ws` flag. The IPC-RPC server is turned on by default. - -The statediffing services works only with `--syncmode="full", but -importantly- does not require garbage collection to be turned off (does not require an archival node). - -e.g. - -$ ./geth --statediff --statediff.streamblock --ws --syncmode "full" - -This starts up the geth node in full sync mode, starts up the statediffing service, and opens up the websocket endpoint to subscribe to the service. -Because the "streamblock" flag has been turned on, the service will strean out block data (headers, transactions, and receipts) along with the diffed state and storage leafs. - -Rpc subscriptions to the service can be created using the rpc.Client.Subscribe() method, -with the "statediff" namespace, a statediff.Payload channel, and the name of the statediff api's rpc method- "stream". - -e.g. - -cli, _ := rpc.Dial("ipcPathOrWsURL") -stateDiffPayloadChan := make(chan statediff.Payload, 20000) -rpcSub, err := cli.Subscribe(context.Background(), "statediff", stateDiffPayloadChan, "stream"}) -for { - select { - case stateDiffPayload := <- stateDiffPayloadChan: - processPayload(stateDiffPayload) - case err := <- rpcSub.Err(): - log.Error(err) - } -} -*/ -package statediff diff --git a/statediff/doc.md b/statediff/doc.md new file mode 100644 index 000000000..622229c0a --- /dev/null +++ b/statediff/doc.md @@ -0,0 +1,207 @@ +This package provides an auxiliary service that asynchronously processes state diff objects from chain events, +either relaying the state objects to rpc subscribers or writing them directly to Postgres. + +It also exposes RPC endpoints for fetching or writing to Postgres the state diff `StateObject` at a specific block height +or for a specific block hash, this operates on historic block and state data and so is dependent on having a complete state archive. + +# Statediff Object +A state diff `StateObject` is the collection of all the state and storage trie nodes that have been updated in a given block. +For convenience, we also associate these nodes with the block number and hash, and optionally the set of code hashes and code for any +contracts deployed in this block. + +A complete state diff `StateObject` will include all state and storage intermediate nodes, which is necessary for generating proofs and for +traversing the tries. + +```go +// StateObject is a collection of state (and linked storage nodes) as well as the associated block number, block hash, +// and a set of code hashes and their code +type StateObject struct { + BlockNumber *big.Int `json:"blockNumber" gencodec:"required"` + BlockHash common.Hash `json:"blockHash" gencodec:"required"` + Nodes []StateNode `json:"nodes" gencodec:"required"` + CodeAndCodeHashes []CodeAndCodeHash `json:"codeMapping"` +} + +// StateNode holds the data for a single state diff node +type StateNode struct { + NodeType NodeType `json:"nodeType" gencodec:"required"` + Path []byte `json:"path" gencodec:"required"` + NodeValue []byte `json:"value" gencodec:"required"` + StorageNodes []StorageNode `json:"storage"` + LeafKey []byte `json:"leafKey"` +} + +// StorageNode holds the data for a single storage diff node +type StorageNode struct { + NodeType NodeType `json:"nodeType" gencodec:"required"` + Path []byte `json:"path" gencodec:"required"` + NodeValue []byte `json:"value" gencodec:"required"` + LeafKey []byte `json:"leafKey"` +} + +// CodeAndCodeHash struct for holding codehash => code mappings +// we can't use an actual map because they are not rlp serializable +type CodeAndCodeHash struct { + Hash common.Hash `json:"codeHash"` + Code []byte `json:"code"` +} +``` +These objects are packed into a `Payload` structure which additionally associates the StateObject +with the block (header, uncles, and transactions), receipts, and total difficulty. +This `Payload` encapsulates all the block and state data at a given block, and allows us to index the entire Ethereum data structure +as hash-linked IPLD objects. + +```go +// Payload packages the data to send to state diff subscriptions +type Payload struct { + BlockRlp []byte `json:"blockRlp"` + TotalDifficulty *big.Int `json:"totalDifficulty"` + ReceiptsRlp []byte `json:"receiptsRlp"` + StateObjectRlp []byte `json:"stateObjectRlp" gencodec:"required"` + + encoded []byte + err error +} +``` + +# Usage +This state diffing service runs as an auxiliary service concurrent to the regular syncing process of the geth node. + + +## CLI configuration +This service introduces a CLI flag namespace `statediff` + +`--statediff` flag is used to turn on the service +`--statediff.writing` is used to tell the service to write state diff objects it produces from synced ChainEvents directly to a configured Postgres database +`--statediff.db` is the connection string for the Postgres database to write to +`--statediff.dbnodeid` is the node id to use in the Postgres database +`--statediff.dbclientname` is the client name to use in the Postgres database + +The service can only operate in full sync mode (`--syncmode=full`), but only the historical RPC endpoints require an archive node (`--gcmode=archive`) + +e.g. +` +./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid={nodeId} --statediff.dbclientname={dbClientName} +` + +## RPC endpoints +The state diffing service exposes both a WS subscription endpoint, and a number of HTTP unary endpoints. + +Each of these endpoints requires a set of parameters provided by the caller + +```go +// Params is used to carry in parameters from subscribing/requesting clients configuration +type Params struct { + IntermediateStateNodes bool + IntermediateStorageNodes bool + IncludeBlock bool + IncludeReceipts bool + IncludeTD bool + IncludeCode bool + WatchedAddresses []common.Address + WatchedStorageSlots []common.Hash +} +``` + +Using these params we can tell the service whether to include state and/or storage intermediate nodes; whether +to include the associated block (header, uncles, and transactions); whether to include the associated receipts; +whether to include the total difficult for this block; whether to include the set of code hashes and code for +contracts deployed in this block; whether to limit the diffing process to a list of specific addresses; and/or +whether to limit the diffing process to a list of specific storage slot keys. + +### Subscription endpoint +A websocket supporting RPC endpoint is exposed for subscribing to state diff `StateObjects` that come off the head of the chain while the geth node syncs. + +```go +// Stream is a subscription endpoint that fires off state diff payloads as they are created +Stream(ctx context.Context, params Params) (*rpc.Subscription, error) +``` + +To expose this endpoint the node needs to have the websocket server turned on (`--ws`), +and the `statediff` namespace exposed (`--ws.api=statediff`). + +Go code subscriptions to this endpoint can be created using the `rpc.Client.Subscribe()` method, +with the "statediff" namespace, a `statediff.Payload` channel, and the name of the statediff api's rpc method: "stream". + +e.g. + +```go + +cli, err := rpc.Dial("ipcPathOrWsURL") +if err != nil { + // handle error +} +stateDiffPayloadChan := make(chan statediff.Payload, 20000) +methodName := "stream" +params := statediff.Params{ + IncludeBlock: true, + IncludeTD: true, + IncludeReceipts: true, + IntermediateStorageNodes: true, + IntermediateStateNodes: true, +} +rpcSub, err := cli.Subscribe(context.Background(), statediff.APIName, stateDiffPayloadChan, methodName, params) +if err != nil { + // handle error +} +for { + select { + case stateDiffPayload := <- stateDiffPayloadChan: + // process the payload + case err := <- rpcSub.Err(): + // handle rpc subscription error + } +} +``` + +### Unary endpoints +The service also exposes unary RPC endpoints for retrieving the state diff `StateObject` for a specific block height/hash. +```go +// StateDiffAt returns a state diff payload at the specific blockheight +StateDiffAt(ctx context.Context, blockNumber uint64, params Params) (*Payload, error) + +// StateDiffFor returns a state diff payload for the specific blockhash +StateDiffFor(ctx context.Context, blockHash common.Hash, params Params) (*Payload, error) +``` + +To expose this endpoint the node needs to have the HTTP server turned on (`--http`), +and the `statediff` namespace exposed (`--http.api=statediff`). + +## Direct indexing into Postgres +If `--statediff.writing` is set, the service will convert the state diff `StateObject` data into IPLD objects, persist them directly to Postgres, +and generate secondary indexes around the IPLD data. + +The schema and migrations for this Postgres database are provided in `statediff/db/`. + +### Postgres setup +We use [pressly/goose](https://github.com/pressly/goose) as our Postgres migration manager. +You can also load the Postgres schema directly into a database using + +`psql database_name < schema.sql` + +This will only work on a version 12.4 Postgres database. + +### Schema overview +Our Postgres schemas are built around a single IPFS backing Postgres IPLD blockstore table (`public.blocks`) that conforms with [go-ds-sql](https://github.com/ipfs/go-ds-sql/blob/master/postgres/postgres.go). +All IPLD objects are stored in this table, where `key` is blockstore-prefixed multihash key for the IPLD object and `data` contains +the bytes for the IPLD block (in the case of all Ethereum IPLDs, this is the RLP byte encoding of the Ethereum object). + +The IPLD objects in this table can be traversed using an IPLD DAG interface, but since this table only maps multihash to raw IPLD object +it is not particularly useful for searching through the data or and does not allow us to look up Ethereum objects by their constituent fields +(e.g. by block number, tx source/recipient, state/storage trie node path). To improve the accessibility of these Ethereum IPLD objects +we generate secondary indexes on top of the raw IPLDs in other Postgres tables. This collection of tables encapsulates an Ethereum [advanced data layout](https://github.com/ipld/specs#schemas-and-advanced-data-layouts) (ADL). + +These secondary index tables fall under the `eth` schema and follow an `{objectType}_cids` naming convention. +Each of these tables provides a view into the individual fields of the underlying Ethereum IPLD object and references the raw IPLD object stored in `public.blocks` by multihash foreign key. +Additionally, these tables link up to their parent object tables. E.g. the `storage_cids` table contains a `state_id` foreign key which references the `id` +for the `state_cids` entry that contains the state leaf node for the contract the storage node belongs to, and in turn that `state_cids` entry contains a `header_id` +foreign key which references the `id` of the `header_cids` entry that contains the header for the block these state and storage nodes were updated (diffed). + +## Optimization +On mainnet this process is extremely IO intensive and requires significant resources to allow it to keep up with the head of the chain. +The state diff processing time for a specific block is dependent on the number and complexity of the state changes that occur in a block and +the number of updated state nodes that are available in the in-memory cache vs must be retrieved from disc. + +If memory permits, one means of improving the efficiency of this process is to increase the trie cache allocation. +This can be done by increasing the overall `--cache` allocation and/or by increasing the % of the cache allocated to trie +usage with `--cache.trie`. \ No newline at end of file