service documentation

This commit is contained in:
Ian Norden 2021-04-05 12:47:39 -05:00
parent d9d05695e6
commit 1328dad81d
3 changed files with 207 additions and 58 deletions

View File

@ -32,7 +32,6 @@ import (
sdtypes "github.com/ethereum/go-ethereum/statediff/types"
)
// TODO: add test that filters on address
var (
contractLeafKey []byte
emptyDiffs = make([]sdtypes.StateNode, 0)

View File

@ -1,57 +0,0 @@
// Copyright 2019 The go-ethereum Authors
// This file is part of the go-ethereum library.
//
// The go-ethereum library is free software: you can redistribute it and/or modify
// it under the terms of the GNU Lesser General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// The go-ethereum library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public License
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.
/*
Package statediff provides an auxiliary service that processes state diff objects from incoming chain events,
relaying the objects to any rpc subscriptions.
This work is adapted from work by Charles Crain at https://github.com/jpmorganchase/quorum/blob/9b7fd9af8082795eeeb6863d9746f12b82dd5078/statediff/statediff.go
The service is spun up using the below CLI flags
--statediff: boolean flag, turns on the service
--statediff.streamblock: boolean flag, configures the service to associate and stream out the rest of the block data with the state diffs.
--statediff.intermediatenodes: boolean flag, tells service to include intermediate (branch and extension) nodes; default (false) processes leaf nodes only.
--statediff.watchedaddresses: string slice flag, used to limit the state diffing process to the given addresses. Usage: --statediff.watchedaddresses=addr1 --statediff.watchedaddresses=addr2 --statediff.watchedaddresses=addr3
If you wish to use the websocket endpoint to subscribe to the statediff service, be sure to open up the Websocket RPC server with the `--ws` flag. The IPC-RPC server is turned on by default.
The statediffing services works only with `--syncmode="full", but -importantly- does not require garbage collection to be turned off (does not require an archival node).
e.g.
$ ./geth --statediff --statediff.streamblock --ws --syncmode "full"
This starts up the geth node in full sync mode, starts up the statediffing service, and opens up the websocket endpoint to subscribe to the service.
Because the "streamblock" flag has been turned on, the service will strean out block data (headers, transactions, and receipts) along with the diffed state and storage leafs.
Rpc subscriptions to the service can be created using the rpc.Client.Subscribe() method,
with the "statediff" namespace, a statediff.Payload channel, and the name of the statediff api's rpc method- "stream".
e.g.
cli, _ := rpc.Dial("ipcPathOrWsURL")
stateDiffPayloadChan := make(chan statediff.Payload, 20000)
rpcSub, err := cli.Subscribe(context.Background(), "statediff", stateDiffPayloadChan, "stream"})
for {
select {
case stateDiffPayload := <- stateDiffPayloadChan:
processPayload(stateDiffPayload)
case err := <- rpcSub.Err():
log.Error(err)
}
}
*/
package statediff

207
statediff/doc.md Normal file
View File

@ -0,0 +1,207 @@
This package provides an auxiliary service that asynchronously processes state diff objects from chain events,
either relaying the state objects to rpc subscribers or writing them directly to Postgres.
It also exposes RPC endpoints for fetching or writing to Postgres the state diff `StateObject` at a specific block height
or for a specific block hash, this operates on historic block and state data and so is dependent on having a complete state archive.
# Statediff Object
A state diff `StateObject` is the collection of all the state and storage trie nodes that have been updated in a given block.
For convenience, we also associate these nodes with the block number and hash, and optionally the set of code hashes and code for any
contracts deployed in this block.
A complete state diff `StateObject` will include all state and storage intermediate nodes, which is necessary for generating proofs and for
traversing the tries.
```go
// StateObject is a collection of state (and linked storage nodes) as well as the associated block number, block hash,
// and a set of code hashes and their code
type StateObject struct {
BlockNumber *big.Int `json:"blockNumber" gencodec:"required"`
BlockHash common.Hash `json:"blockHash" gencodec:"required"`
Nodes []StateNode `json:"nodes" gencodec:"required"`
CodeAndCodeHashes []CodeAndCodeHash `json:"codeMapping"`
}
// StateNode holds the data for a single state diff node
type StateNode struct {
NodeType NodeType `json:"nodeType" gencodec:"required"`
Path []byte `json:"path" gencodec:"required"`
NodeValue []byte `json:"value" gencodec:"required"`
StorageNodes []StorageNode `json:"storage"`
LeafKey []byte `json:"leafKey"`
}
// StorageNode holds the data for a single storage diff node
type StorageNode struct {
NodeType NodeType `json:"nodeType" gencodec:"required"`
Path []byte `json:"path" gencodec:"required"`
NodeValue []byte `json:"value" gencodec:"required"`
LeafKey []byte `json:"leafKey"`
}
// CodeAndCodeHash struct for holding codehash => code mappings
// we can't use an actual map because they are not rlp serializable
type CodeAndCodeHash struct {
Hash common.Hash `json:"codeHash"`
Code []byte `json:"code"`
}
```
These objects are packed into a `Payload` structure which additionally associates the StateObject
with the block (header, uncles, and transactions), receipts, and total difficulty.
This `Payload` encapsulates all the block and state data at a given block, and allows us to index the entire Ethereum data structure
as hash-linked IPLD objects.
```go
// Payload packages the data to send to state diff subscriptions
type Payload struct {
BlockRlp []byte `json:"blockRlp"`
TotalDifficulty *big.Int `json:"totalDifficulty"`
ReceiptsRlp []byte `json:"receiptsRlp"`
StateObjectRlp []byte `json:"stateObjectRlp" gencodec:"required"`
encoded []byte
err error
}
```
# Usage
This state diffing service runs as an auxiliary service concurrent to the regular syncing process of the geth node.
## CLI configuration
This service introduces a CLI flag namespace `statediff`
`--statediff` flag is used to turn on the service
`--statediff.writing` is used to tell the service to write state diff objects it produces from synced ChainEvents directly to a configured Postgres database
`--statediff.db` is the connection string for the Postgres database to write to
`--statediff.dbnodeid` is the node id to use in the Postgres database
`--statediff.dbclientname` is the client name to use in the Postgres database
The service can only operate in full sync mode (`--syncmode=full`), but only the historical RPC endpoints require an archive node (`--gcmode=archive`)
e.g.
`
./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid={nodeId} --statediff.dbclientname={dbClientName}
`
## RPC endpoints
The state diffing service exposes both a WS subscription endpoint, and a number of HTTP unary endpoints.
Each of these endpoints requires a set of parameters provided by the caller
```go
// Params is used to carry in parameters from subscribing/requesting clients configuration
type Params struct {
IntermediateStateNodes bool
IntermediateStorageNodes bool
IncludeBlock bool
IncludeReceipts bool
IncludeTD bool
IncludeCode bool
WatchedAddresses []common.Address
WatchedStorageSlots []common.Hash
}
```
Using these params we can tell the service whether to include state and/or storage intermediate nodes; whether
to include the associated block (header, uncles, and transactions); whether to include the associated receipts;
whether to include the total difficult for this block; whether to include the set of code hashes and code for
contracts deployed in this block; whether to limit the diffing process to a list of specific addresses; and/or
whether to limit the diffing process to a list of specific storage slot keys.
### Subscription endpoint
A websocket supporting RPC endpoint is exposed for subscribing to state diff `StateObjects` that come off the head of the chain while the geth node syncs.
```go
// Stream is a subscription endpoint that fires off state diff payloads as they are created
Stream(ctx context.Context, params Params) (*rpc.Subscription, error)
```
To expose this endpoint the node needs to have the websocket server turned on (`--ws`),
and the `statediff` namespace exposed (`--ws.api=statediff`).
Go code subscriptions to this endpoint can be created using the `rpc.Client.Subscribe()` method,
with the "statediff" namespace, a `statediff.Payload` channel, and the name of the statediff api's rpc method: "stream".
e.g.
```go
cli, err := rpc.Dial("ipcPathOrWsURL")
if err != nil {
// handle error
}
stateDiffPayloadChan := make(chan statediff.Payload, 20000)
methodName := "stream"
params := statediff.Params{
IncludeBlock: true,
IncludeTD: true,
IncludeReceipts: true,
IntermediateStorageNodes: true,
IntermediateStateNodes: true,
}
rpcSub, err := cli.Subscribe(context.Background(), statediff.APIName, stateDiffPayloadChan, methodName, params)
if err != nil {
// handle error
}
for {
select {
case stateDiffPayload := <- stateDiffPayloadChan:
// process the payload
case err := <- rpcSub.Err():
// handle rpc subscription error
}
}
```
### Unary endpoints
The service also exposes unary RPC endpoints for retrieving the state diff `StateObject` for a specific block height/hash.
```go
// StateDiffAt returns a state diff payload at the specific blockheight
StateDiffAt(ctx context.Context, blockNumber uint64, params Params) (*Payload, error)
// StateDiffFor returns a state diff payload for the specific blockhash
StateDiffFor(ctx context.Context, blockHash common.Hash, params Params) (*Payload, error)
```
To expose this endpoint the node needs to have the HTTP server turned on (`--http`),
and the `statediff` namespace exposed (`--http.api=statediff`).
## Direct indexing into Postgres
If `--statediff.writing` is set, the service will convert the state diff `StateObject` data into IPLD objects, persist them directly to Postgres,
and generate secondary indexes around the IPLD data.
The schema and migrations for this Postgres database are provided in `statediff/db/`.
### Postgres setup
We use [pressly/goose](https://github.com/pressly/goose) as our Postgres migration manager.
You can also load the Postgres schema directly into a database using
`psql database_name < schema.sql`
This will only work on a version 12.4 Postgres database.
### Schema overview
Our Postgres schemas are built around a single IPFS backing Postgres IPLD blockstore table (`public.blocks`) that conforms with [go-ds-sql](https://github.com/ipfs/go-ds-sql/blob/master/postgres/postgres.go).
All IPLD objects are stored in this table, where `key` is blockstore-prefixed multihash key for the IPLD object and `data` contains
the bytes for the IPLD block (in the case of all Ethereum IPLDs, this is the RLP byte encoding of the Ethereum object).
The IPLD objects in this table can be traversed using an IPLD DAG interface, but since this table only maps multihash to raw IPLD object
it is not particularly useful for searching through the data or and does not allow us to look up Ethereum objects by their constituent fields
(e.g. by block number, tx source/recipient, state/storage trie node path). To improve the accessibility of these Ethereum IPLD objects
we generate secondary indexes on top of the raw IPLDs in other Postgres tables. This collection of tables encapsulates an Ethereum [advanced data layout](https://github.com/ipld/specs#schemas-and-advanced-data-layouts) (ADL).
These secondary index tables fall under the `eth` schema and follow an `{objectType}_cids` naming convention.
Each of these tables provides a view into the individual fields of the underlying Ethereum IPLD object and references the raw IPLD object stored in `public.blocks` by multihash foreign key.
Additionally, these tables link up to their parent object tables. E.g. the `storage_cids` table contains a `state_id` foreign key which references the `id`
for the `state_cids` entry that contains the state leaf node for the contract the storage node belongs to, and in turn that `state_cids` entry contains a `header_id`
foreign key which references the `id` of the `header_cids` entry that contains the header for the block these state and storage nodes were updated (diffed).
## Optimization
On mainnet this process is extremely IO intensive and requires significant resources to allow it to keep up with the head of the chain.
The state diff processing time for a specific block is dependent on the number and complexity of the state changes that occur in a block and
the number of updated state nodes that are available in the in-memory cache vs must be retrieved from disc.
If memory permits, one means of improving the efficiency of this process is to increase the trie cache allocation.
This can be done by increasing the overall `--cache` allocation and/or by increasing the % of the cache allocated to trie
usage with `--cache.trie`.