service documentation

2021-04-05 12:47:39 -05:00 · 2021-04-05 12:47:39 -05:00 · 1328dad81d
commit 1328dad81d
parent d9d05695e6
3 changed files with 207 additions and 58 deletions
--- a/statediff/builder_test.go
+++ b/statediff/builder_test.go
@ -32,7 +32,6 @@ import (
 	sdtypes "github.com/ethereum/go-ethereum/statediff/types"
 )

-// TODO: add test that filters on address
 var (
 	contractLeafKey                                        []byte
 	emptyDiffs                                             = make([]sdtypes.StateNode, 0)
--- a/statediff/doc.go
+++ b/statediff/doc.go
@ -1,57 +0,0 @@
-// Copyright 2019 The go-ethereum Authors
-// This file is part of the go-ethereum library.
-//
-// The go-ethereum library is free software: you can redistribute it and/or modify
-// it under the terms of the GNU Lesser General Public License as published by
-// the Free Software Foundation, either version 3 of the License, or
-// (at your option) any later version.
-//
-// The go-ethereum library is distributed in the hope that it will be useful,
-// but WITHOUT ANY WARRANTY; without even the implied warranty of
-// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-// GNU Lesser General Public License for more details.
-//
-// You should have received a copy of the GNU Lesser General Public License
-// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.
-
-/*
-Package statediff provides an auxiliary service that processes state diff objects from incoming chain events,
-relaying the objects to any rpc subscriptions.
-
-This work is adapted from work by Charles Crain at https://github.com/jpmorganchase/quorum/blob/9b7fd9af8082795eeeb6863d9746f12b82dd5078/statediff/statediff.go
-
-The service is spun up using the below CLI flags
--statediff: boolean flag, turns on the service
--statediff.streamblock: boolean flag, configures the service to associate and stream out the rest of the block data with the state diffs.
--statediff.intermediatenodes: boolean flag, tells service to include intermediate (branch and extension) nodes; default (false) processes leaf nodes only.
--statediff.watchedaddresses: string slice flag, used to limit the state diffing process to the given addresses. Usage: --statediff.watchedaddresses=addr1 --statediff.watchedaddresses=addr2 --statediff.watchedaddresses=addr3
-
-If you wish to use the websocket endpoint to subscribe to the statediff service, be sure to open up the Websocket RPC server with the `--ws` flag. The IPC-RPC server is turned on by default.
-
-The statediffing services works only with `--syncmode="full", but -importantly- does not require garbage collection to be turned off (does not require an archival node).
-
-e.g.
-
-$ ./geth --statediff --statediff.streamblock --ws --syncmode "full"
-
-This starts up the geth node in full sync mode, starts up the statediffing service, and opens up the websocket endpoint to subscribe to the service.
-Because the "streamblock" flag has been turned on, the service will strean out block data (headers, transactions, and receipts) along with the diffed state and storage leafs.
-
-Rpc subscriptions to the service can be created using the rpc.Client.Subscribe() method,
-with the "statediff" namespace, a statediff.Payload channel, and the name of the statediff api's rpc method- "stream".
-
-e.g.
-
-cli, _ := rpc.Dial("ipcPathOrWsURL")
-stateDiffPayloadChan := make(chan statediff.Payload, 20000)
-rpcSub, err := cli.Subscribe(context.Background(), "statediff", stateDiffPayloadChan, "stream"})
-for {
-	select {
-	case stateDiffPayload := <- stateDiffPayloadChan:
-		processPayload(stateDiffPayload)
-	case err := <- rpcSub.Err():
-		log.Error(err)
-	}
-}
-*/
-package statediff
--- a/statediff/doc.md
+++ b/statediff/doc.md
@ -0,0 +1,207 @@
+This package provides an auxiliary service that asynchronously processes state diff objects from chain events,
+either relaying the state objects to rpc subscribers or writing them directly to Postgres.
+
+It also exposes RPC endpoints for fetching or writing to Postgres the state diff `StateObject` at a specific block height
+or for a specific block hash, this operates on historic block and state data and so is dependent on having a complete state archive.
+
+# Statediff Object
+A state diff `StateObject` is the collection of all the state and storage trie nodes that have been updated in a given block.
+For convenience, we also associate these nodes with the block number and hash, and optionally the set of code hashes and code for any
+contracts deployed in this block.
+
+A complete state diff `StateObject` will include all state and storage intermediate nodes, which is necessary for generating proofs and for
+traversing the tries.
+
+```go
+// StateObject is a collection of state (and linked storage nodes) as well as the associated block number, block hash,
+// and a set of code hashes and their code
+type StateObject struct {
+	BlockNumber       *big.Int                `json:"blockNumber"     gencodec:"required"`
+	BlockHash         common.Hash             `json:"blockHash"       gencodec:"required"`
+	Nodes             []StateNode             `json:"nodes"           gencodec:"required"`
+	CodeAndCodeHashes []CodeAndCodeHash       `json:"codeMapping"`
+}
+
+// StateNode holds the data for a single state diff node
+type StateNode struct {
+    NodeType     NodeType      `json:"nodeType"        gencodec:"required"`
+    Path         []byte        `json:"path"            gencodec:"required"`
+    NodeValue    []byte        `json:"value"           gencodec:"required"`
+    StorageNodes []StorageNode `json:"storage"`
+    LeafKey      []byte        `json:"leafKey"`
+}
+
+// StorageNode holds the data for a single storage diff node
+type StorageNode struct {
+    NodeType  NodeType `json:"nodeType"        gencodec:"required"`
+    Path      []byte   `json:"path"            gencodec:"required"`
+    NodeValue []byte   `json:"value"           gencodec:"required"`
+    LeafKey   []byte   `json:"leafKey"`
+}
+
+// CodeAndCodeHash struct for holding codehash => code mappings
+// we can't use an actual map because they are not rlp serializable
+type CodeAndCodeHash struct {
+    Hash common.Hash `json:"codeHash"`
+    Code []byte      `json:"code"`
+}
+```
+These objects are packed into a `Payload` structure which additionally associates the StateObject
+with the block (header, uncles, and transactions), receipts, and total difficulty.
+This `Payload` encapsulates all the block and state data at a given block, and allows us to index the entire Ethereum data structure
+as hash-linked IPLD objects.
+
+```go
+// Payload packages the data to send to state diff subscriptions
+type Payload struct {
+	BlockRlp        []byte   `json:"blockRlp"`
+	TotalDifficulty *big.Int `json:"totalDifficulty"`
+	ReceiptsRlp     []byte   `json:"receiptsRlp"`
+	StateObjectRlp  []byte   `json:"stateObjectRlp"    gencodec:"required"`
+
+	encoded []byte
+	err     error
+}
+```
+
+# Usage
+This state diffing service runs as an auxiliary service concurrent to the regular syncing process of the geth node.
+
+
+## CLI configuration
+This service introduces a CLI flag namespace `statediff`
+
+`--statediff` flag is used to turn on the service
+`--statediff.writing` is used to tell the service to write state diff objects it produces from synced ChainEvents directly to a configured Postgres database
+`--statediff.db` is the connection string for the Postgres database to write to
+`--statediff.dbnodeid` is the node id to use in the Postgres database
+`--statediff.dbclientname` is the client name to use in the Postgres database
+
+The service can only operate in full sync mode (`--syncmode=full`), but only the historical RPC endpoints require an archive node (`--gcmode=archive`)
+
+e.g.
+`
+./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid={nodeId} --statediff.dbclientname={dbClientName}
+`
+
+## RPC endpoints
+The state diffing service exposes both a WS subscription endpoint, and a number of HTTP unary endpoints.
+
+Each of these endpoints requires a set of parameters provided by the caller
+
+```go
+// Params is used to carry in parameters from subscribing/requesting clients configuration
+type Params struct {
+	IntermediateStateNodes   bool
+	IntermediateStorageNodes bool
+	IncludeBlock             bool
+	IncludeReceipts          bool
+	IncludeTD                bool
+	IncludeCode              bool
+	WatchedAddresses         []common.Address
+	WatchedStorageSlots      []common.Hash
+}
+```
+
+Using these params we can tell the service whether to include state and/or storage intermediate nodes; whether
+to include the associated block (header, uncles, and transactions); whether to include the associated receipts;
+whether to include the total difficult for this block; whether to include the set of code hashes and code for
+contracts deployed in this block; whether to limit the diffing process to a list of specific addresses; and/or
+whether to limit the diffing process to a list of specific storage slot keys.
+
+### Subscription endpoint
+A websocket supporting RPC endpoint is exposed for subscribing to state diff `StateObjects` that come off the head of the chain while the geth node syncs.
+
+```go
+// Stream is a subscription endpoint that fires off state diff payloads as they are created
+Stream(ctx context.Context, params Params) (*rpc.Subscription, error)
+```
+
+To expose this endpoint the node needs to have the websocket server turned on (`--ws`),
+and the `statediff` namespace exposed (`--ws.api=statediff`).
+
+Go code subscriptions to this endpoint can be created using the `rpc.Client.Subscribe()` method,
+with the "statediff" namespace, a `statediff.Payload` channel, and the name of the statediff api's rpc method: "stream".
+
+e.g.
+
+```go
+
+cli, err := rpc.Dial("ipcPathOrWsURL")
+if err != nil {
+	// handle error
+}
+stateDiffPayloadChan := make(chan statediff.Payload, 20000)
+methodName := "stream"
+params := statediff.Params{
+    IncludeBlock:             true,
+    IncludeTD:                true,
+    IncludeReceipts:          true,
+    IntermediateStorageNodes: true,
+    IntermediateStateNodes:   true,
+}
+rpcSub, err := cli.Subscribe(context.Background(), statediff.APIName, stateDiffPayloadChan, methodName, params)
+if err != nil {
+	// handle error
+}
+for {
+	select {
+	case stateDiffPayload := <- stateDiffPayloadChan:
+            // process the payload
+        case err := <- rpcSub.Err():
+    	    // handle rpc subscription error
+        }
+}
+```
+
+### Unary endpoints
+The service also exposes unary RPC endpoints for retrieving the state diff `StateObject` for a specific block height/hash.
+```go
+// StateDiffAt returns a state diff payload at the specific blockheight
+StateDiffAt(ctx context.Context, blockNumber uint64, params Params) (*Payload, error)
+
+// StateDiffFor returns a state diff payload for the specific blockhash
+StateDiffFor(ctx context.Context, blockHash common.Hash, params Params) (*Payload, error)
+```
+
+To expose this endpoint the node needs to have the HTTP server turned on (`--http`),
+and the `statediff` namespace exposed (`--http.api=statediff`).
+
+## Direct indexing into Postgres
+If `--statediff.writing` is set, the service will convert the state diff `StateObject` data into IPLD objects, persist them directly to Postgres,
+and generate secondary indexes around the IPLD data.
+
+The schema and migrations for this Postgres database are provided in `statediff/db/`.
+
+### Postgres setup
+We use [pressly/goose](https://github.com/pressly/goose) as our Postgres migration manager.
+You can also load the Postgres schema directly into a database using
+
+`psql database_name < schema.sql`
+
+This will only work on a version 12.4 Postgres database.
+
+### Schema overview 
+Our Postgres schemas are built around a single IPFS backing Postgres IPLD blockstore table (`public.blocks`) that conforms with [go-ds-sql](https://github.com/ipfs/go-ds-sql/blob/master/postgres/postgres.go).
+All IPLD objects are stored in this table, where `key` is blockstore-prefixed multihash key for the IPLD object and `data` contains
+the bytes for the IPLD block (in the case of all Ethereum IPLDs, this is the RLP byte encoding of the Ethereum object).
+
+The IPLD objects in this table can be traversed using an IPLD DAG interface, but since this table only maps multihash to raw IPLD object
+it is not particularly useful for searching through the data or and does not allow us to look up Ethereum objects by their constituent fields
+(e.g. by block number, tx source/recipient, state/storage trie node path). To improve the accessibility of these Ethereum IPLD objects
+we generate secondary indexes on top of the raw IPLDs in other Postgres tables. This collection of tables encapsulates an Ethereum [advanced data layout](https://github.com/ipld/specs#schemas-and-advanced-data-layouts) (ADL).
+
+These secondary index tables fall under the `eth` schema and follow an `{objectType}_cids` naming convention.
+Each of these tables provides a view into the individual fields of the underlying Ethereum IPLD object and references the raw IPLD object stored in `public.blocks` by multihash foreign key.
+Additionally, these tables link up to their parent object tables. E.g. the `storage_cids` table contains a `state_id` foreign key which references the `id`
+for the `state_cids` entry that contains the state leaf node for the contract the storage node belongs to, and in turn that `state_cids` entry contains a `header_id`
+foreign key which references the `id` of the `header_cids` entry that contains the header for the block these state and storage nodes were updated (diffed).
+
+## Optimization
+On mainnet this process is extremely IO intensive and requires significant resources to allow it to keep up with the head of the chain.
+The state diff processing time for a specific block is dependent on the number and complexity of the state changes that occur in a block and
+the number of updated state nodes that are available in the in-memory cache vs must be retrieved from disc.
+
+If memory permits, one means of improving the efficiency of this process is to increase the trie cache allocation.
+This can be done by increasing the overall `--cache` allocation and/or by increasing the % of the cache allocated to trie
+usage with `--cache.trie`.