1162162c0a
* Write state diff to CSV (#2)
* port statediff from 9b7fd9af80/statediff/statediff.go
; minor fixes
* integrating state diff extracting, building, and persisting into geth processes
* work towards persisting created statediffs in ipfs; based off github.com/vulcanize/eth-block-extractor
* Add a state diff service
* Remove diff extractor from blockchain
* Update imports
* Move statediff on/off check to geth cmd config
* Update starting state diff service
* Add debugging logs for creating diff
* Add statediff extractor and builder tests and small refactoring
* Start to write statediff to a CSV
* Restructure statediff directory
* Pull CSV publishing methods into their own file
* Reformatting due to go fmt
* Add gomega to vendor dir
* Remove testing focuses
* Update statediff tests to use golang test pkg
instead of ginkgo
- builder_test
- extractor_test
- publisher_test
* Use hexutil.Encode instead of deprecated common.ToHex
* Remove OldValue from DiffBigInt and DiffUint64 fields
* Update builder test
* Remove old storage value from updated accounts
* Remove old values from created/deleted accounts
* Update publisher to account for only storing current account values
* Update service loop and fetching previous block
* Update testing
- remove statediff ginkgo test suite file
- move mocks to their own dir
* Updates per go fmt
* Updates to tests
* Pass statediff mode and path in through cli
* Return filename from publisher
* Remove some duplication in builder
* Remove code field from state diff output
this is the contract byte code, and it can still be obtained by querying
the db by the codeHash
* Consolidate acct diff structs for updated & updated/deleted accts
* Include block number in csv filename
* Clean up error logging
* Cleanup formatting, spelling, etc
* Address PR comments
* Add contract address and storage value to csv
* Refactor accumulating account row in csv publisher
* Add DiffStorage struct
* Add storage key to csv
* Address PR comments
* Fix publisher to include rows for accounts that don't have store updates
* Update builder test after merging in release/1.8
* Update test contract to include storage on contract intialization
- so that we're able to test that storage diffing works for created and
deleted accounts (not just updated accounts).
* Factor out a common trie iterator method in builder
* Apply goimports to statediff
* Apply gosimple changes to statediff
* Gracefully exit geth command(#4)
* Statediff for full node (#6)
* Open a trie from the in-memory database
* Use a node's LeafKey as an identifier instead of the address
It was proving difficult to find look the address up from a given path
with a full node (sometimes the value wouldn't exist in the disk db).
So, instead, for now we are using the node's LeafKey with is a Keccak256
hash of the address, so if we know the address we can figure out which
LeafKey it matches up to.
* Make sure that statediff has been processed before pruning
* Use blockchain stateCache.OpenTrie for storage diffs
* Clean up log lines and remove unnecessary fields from builder
* Apply go fmt changes
* Add a sleep to the blockchain test
* Address PR comments
* Address PR comments
* refactoring/reorganizing packages
* refactoring statediff builder and types and adjusted to relay proofs and paths (still need to make this optional)
* refactoring state diff service and adding api which allows for streaming state diff payloads over an rpc websocket subscription
* make proofs and paths optional + compress service loop into single for loop (may be missing something here)
* option to process intermediate nodes
* make state diff rlp serializable
* cli parameter to limit statediffing to select account addresses + test
* review fixes and fixes for issues ran into in integration
* review fixes; proper method signature for api; adjust service so that statediff processing is halted/paused until there is at least one subscriber listening for the results
* adjust buffering to improve stability; doc.go; fix notifier
err handling
* relay receipts with the rest of the data + review fixes/changes
* rpc method to get statediff at specific block; requires archival node or the block be within the pruning range
* review fixes
* fixes after rebase
* statediff verison meta
* fix linter issues
* include total difficulty to the payload
* fix state diff builder: emit actual leaf nodes instead of value nodes; diff on the leaf not on the value; emit correct path for intermediate nodes
* adjust statediff builder tests to changes and extend to test intermediate nodes; golint
* add genesis block to test; handle block 0 in StateDiffAt
* rlp files for mainnet blocks 0-3, for tests
* builder test on mainnet blocks
* common.BytesToHash(path) => crypto.Keaccak256(hash) in builder; BytesToHash produces same hash for e.g. []byte{} and []byte{\x00} - prefix \x00 steps are inconsequential to the hash result
* complete tests for early mainnet blocks
* diff type for representing deleted accounts
* fix builder so that we handle account deletions properly and properly diff storage when an account is moved to a new path; update params
* remove cli params; moving them to subscriber defined
* remove unneeded bc methods
* update service and api; statediffing params are now defined by user through api rather than by service provider by cli
* update top level tests
* add ability to watch specific storage slots (leaf keys) only
* comments; explain logic
* update mainnet blocks test
* update api_test.go
* storage leafkey filter test
* cleanup chain maker
* adjust chain maker for tests to add an empty account in block1 and switch to EIP-158 afterwards (now we just need to generate enough accounts until one causes the empty account to be touched and removed post-EIP-158 so we can simulate and test that process...); also added 2 new blocks where more contract storage is set and old slots are set to zero so they are removed so we can test that
* found an account whose creation causes the empty account to be moved to a new path; this should count as 'touching; the empty account and cause it to be removed according to eip-158... but it doesn't
* use new contract in unit tests that has self-destruct ability, so we can test eip-158 since simply moving an account to new path doesn't count as 'touchin' it
* handle storage deletions
* tests for eip-158 account removal and storage value deletions; there is one edge case left to test where we remove 1 account when only two exist such that the remaining account is moved up and replaces the root branch node
* finish testing known edge cases
* add endpoint to fetch all state and storage nodes at a given blockheight; useful for generating a recent atate cache/snapshot that we can diff forward from rather than needing to collect all diffs from genesis
* test for state trie builder
* minor changes/fixes
* update version meta
* if statediffing is on, lock tries in triedb until the statediffing service signals they are done using them
* update version meta
* fix mock blockchain; golint; bump patch
* increase maxRequestContentLength; bump patch
* log the sizes of the state objects we are sending
* CI build (#20)
* CI: run build on PR and on push to master
* CI: debug building geth
* CI: fix coping file
* CI: fix coping file v2
* CI: temporary upload file to release asset
* CI: get release upload_url by tag, upload asset to current relase
* CI: fix tag name
* fix ci build on statediff_at_anyblock-1.9.11 branch
* fix publishing assets in release
* bump version meta
* use context deadline for timeout in eth_call
* collect and emit codehash=>code mappings for state objects
* subscription endpoint for retrieving all the codehash=>code mappings that exist at provided height
* bump version meta
* Implement WriteStateDiffAt
* Writes state diffs directly to postgres
* Adds CLI flags to configure PG
* Refactors builder output with callbacks
* Copies refactored postgres handling code from ipld-eth-indexer
* rename PostgresCIDWriter.{index->upsert}*
* less ambiguous
* go.mod update
* rm unused
* cleanup
* output code & codehash iteratively
* had to rf some types for this
* prometheus metrics output
* duplicate recent eth-indexer changes
* migrations and metrics...
* [wip] prom.Init() here? another CLI flag?
* cleanup
* tidy & DRY
* statediff WriteLoop service + CLI flag
* [wip] update test mocks
* todo - do something meaningful to test write loop
* logging
* use geth log
* port tests to go testing
* drop ginkgo/gomega
* fix and cleanup tests
* fail before defer statement
* delete vendor/ dir
* unused
* bump version meta
* fixes after rebase onto 1.9.23
* bump version meta
* fix API registration
* bump version meta
* use golang 1.15.5 version (#34)
* bump version meta; add 0.0.11 branch to actions
* bump version meta; update github actions workflows
* statediff: refactor metrics
* Remove redundant statediff/indexer/prom tooling and use existing
prometheus integration.
* cleanup
* "indexer" namespace for metrics
* add reporting loop for db metrics
* doc
* metrics for statediff stats
* metrics namespace/subsystem = statediff/{indexer,service}
* statediff: use a worker pool (for direct writes)
* fix test
* fix chain event subscription
* log tweaks
* func name
* unused import
* intermediate chain event channel for metrics
* cleanup
* bump version meta
* update github actions; linting
* add poststate and status to receipt ipld indexes
* bump statediff version
* stateDiffFor endpoints for fetching or writing statediff object by blockhash; bump statediff version
* fixes after rebase on to v1.10.1
* update github actions and version meta; go fmt
* add leaf key to removed 'nodes'
* include Postgres migrations and schema
* service documentation
* touching up
215 lines
11 KiB
Markdown
215 lines
11 KiB
Markdown
# Statediff
|
|
|
|
This package provides an auxiliary service that asynchronously processes state diff objects from chain events,
|
|
either relaying the state objects to RPC subscribers or writing them directly to Postgres as IPLD objects.
|
|
|
|
It also exposes RPC endpoints for fetching or writing to Postgres the state diff at a specific block height
|
|
or for a specific block hash, this operates on historical block and state data and so depends on a complete state archive.
|
|
|
|
Data is emitted in this differential format in order to make it feasible to IPLD-ize and index the *entire* Ethereum state
|
|
(including intermediate state and storage trie nodes). If this state diff process is ran continuously from genesis,
|
|
the entire state at any block can be materialized from the cumulative differentials up to that point.
|
|
|
|
## Statediff object
|
|
A state diff `StateObject` is the collection of all the state and storage trie nodes that have been updated in a given block.
|
|
For convenience, we also associate these nodes with the block number and hash, and optionally the set of code hashes and code for any
|
|
contracts deployed in this block.
|
|
|
|
A complete state diff `StateObject` will include all state and storage intermediate nodes, which is necessary for generating proofs and for
|
|
traversing the tries.
|
|
|
|
```go
|
|
// StateObject is a collection of state (and linked storage nodes) as well as the associated block number, block hash,
|
|
// and a set of code hashes and their code
|
|
type StateObject struct {
|
|
BlockNumber *big.Int `json:"blockNumber" gencodec:"required"`
|
|
BlockHash common.Hash `json:"blockHash" gencodec:"required"`
|
|
Nodes []StateNode `json:"nodes" gencodec:"required"`
|
|
CodeAndCodeHashes []CodeAndCodeHash `json:"codeMapping"`
|
|
}
|
|
|
|
// StateNode holds the data for a single state diff node
|
|
type StateNode struct {
|
|
NodeType NodeType `json:"nodeType" gencodec:"required"`
|
|
Path []byte `json:"path" gencodec:"required"`
|
|
NodeValue []byte `json:"value" gencodec:"required"`
|
|
StorageNodes []StorageNode `json:"storage"`
|
|
LeafKey []byte `json:"leafKey"`
|
|
}
|
|
|
|
// StorageNode holds the data for a single storage diff node
|
|
type StorageNode struct {
|
|
NodeType NodeType `json:"nodeType" gencodec:"required"`
|
|
Path []byte `json:"path" gencodec:"required"`
|
|
NodeValue []byte `json:"value" gencodec:"required"`
|
|
LeafKey []byte `json:"leafKey"`
|
|
}
|
|
|
|
// CodeAndCodeHash struct for holding codehash => code mappings
|
|
// we can't use an actual map because they are not rlp serializable
|
|
type CodeAndCodeHash struct {
|
|
Hash common.Hash `json:"codeHash"`
|
|
Code []byte `json:"code"`
|
|
}
|
|
```
|
|
These objects are packed into a `Payload` structure which can additionally associate the `StateObject`
|
|
with the block (header, uncles, and transactions), receipts, and total difficulty.
|
|
This `Payload` encapsulates all of the differential data at a given block, and allows us to index the entire Ethereum data structure
|
|
as hash-linked IPLD objects.
|
|
|
|
```go
|
|
// Payload packages the data to send to state diff subscriptions
|
|
type Payload struct {
|
|
BlockRlp []byte `json:"blockRlp"`
|
|
TotalDifficulty *big.Int `json:"totalDifficulty"`
|
|
ReceiptsRlp []byte `json:"receiptsRlp"`
|
|
StateObjectRlp []byte `json:"stateObjectRlp" gencodec:"required"`
|
|
|
|
encoded []byte
|
|
err error
|
|
}
|
|
```
|
|
|
|
## Usage
|
|
This state diffing service runs as an auxiliary service concurrent to the regular syncing process of the geth node.
|
|
|
|
|
|
### CLI configuration
|
|
This service introduces a CLI flag namespace `statediff`
|
|
|
|
`--statediff` flag is used to turn on the service
|
|
`--statediff.writing` is used to tell the service to write state diff objects it produces from synced ChainEvents directly to a configured Postgres database
|
|
`--statediff.db` is the connection string for the Postgres database to write to
|
|
`--statediff.dbnodeid` is the node id to use in the Postgres database
|
|
`--statediff.dbclientname` is the client name to use in the Postgres database
|
|
|
|
The service can only operate in full sync mode (`--syncmode=full`), but only the historical RPC endpoints require an archive node (`--gcmode=archive`)
|
|
|
|
e.g.
|
|
`
|
|
./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid={nodeId} --statediff.dbclientname={dbClientName}
|
|
`
|
|
|
|
### RPC endpoints
|
|
The state diffing service exposes both a WS subscription endpoint, and a number of HTTP unary endpoints.
|
|
|
|
Each of these endpoints requires a set of parameters provided by the caller
|
|
|
|
```go
|
|
// Params is used to carry in parameters from subscribing/requesting clients configuration
|
|
type Params struct {
|
|
IntermediateStateNodes bool
|
|
IntermediateStorageNodes bool
|
|
IncludeBlock bool
|
|
IncludeReceipts bool
|
|
IncludeTD bool
|
|
IncludeCode bool
|
|
WatchedAddresses []common.Address
|
|
WatchedStorageSlots []common.Hash
|
|
}
|
|
```
|
|
|
|
Using these params we can tell the service whether to include state and/or storage intermediate nodes; whether
|
|
to include the associated block (header, uncles, and transactions); whether to include the associated receipts;
|
|
whether to include the total difficulty for this block; whether to include the set of code hashes and code for
|
|
contracts deployed in this block; whether to limit the diffing process to a list of specific addresses; and/or
|
|
whether to limit the diffing process to a list of specific storage slot keys.
|
|
|
|
#### Subscription endpoint
|
|
A websocket supporting RPC endpoint is exposed for subscribing to state diff `StateObjects` that come off the head of the chain while the geth node syncs.
|
|
|
|
```go
|
|
// Stream is a subscription endpoint that fires off state diff payloads as they are created
|
|
Stream(ctx context.Context, params Params) (*rpc.Subscription, error)
|
|
```
|
|
|
|
To expose this endpoint the node needs to have the websocket server turned on (`--ws`),
|
|
and the `statediff` namespace exposed (`--ws.api=statediff`).
|
|
|
|
Go code subscriptions to this endpoint can be created using the `rpc.Client.Subscribe()` method,
|
|
with the "statediff" namespace, a `statediff.Payload` channel, and the name of the statediff api's rpc method: "stream".
|
|
|
|
e.g.
|
|
|
|
```go
|
|
|
|
cli, err := rpc.Dial("ipcPathOrWsURL")
|
|
if err != nil {
|
|
// handle error
|
|
}
|
|
stateDiffPayloadChan := make(chan statediff.Payload, 20000)
|
|
methodName := "stream"
|
|
params := statediff.Params{
|
|
IncludeBlock: true,
|
|
IncludeTD: true,
|
|
IncludeReceipts: true,
|
|
IntermediateStorageNodes: true,
|
|
IntermediateStateNodes: true,
|
|
}
|
|
rpcSub, err := cli.Subscribe(context.Background(), statediff.APIName, stateDiffPayloadChan, methodName, params)
|
|
if err != nil {
|
|
// handle error
|
|
}
|
|
for {
|
|
select {
|
|
case stateDiffPayload := <- stateDiffPayloadChan:
|
|
// process the payload
|
|
case err := <- rpcSub.Err():
|
|
// handle rpc subscription error
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Unary endpoints
|
|
The service also exposes unary RPC endpoints for retrieving the state diff `StateObject` for a specific block height/hash.
|
|
```go
|
|
// StateDiffAt returns a state diff payload at the specific blockheight
|
|
StateDiffAt(ctx context.Context, blockNumber uint64, params Params) (*Payload, error)
|
|
|
|
// StateDiffFor returns a state diff payload for the specific blockhash
|
|
StateDiffFor(ctx context.Context, blockHash common.Hash, params Params) (*Payload, error)
|
|
```
|
|
|
|
To expose this endpoint the node needs to have the HTTP server turned on (`--http`),
|
|
and the `statediff` namespace exposed (`--http.api=statediff`).
|
|
|
|
### Direct indexing into Postgres
|
|
If `--statediff.writing` is set, the service will convert the state diff `StateObject` data into IPLD objects, persist them directly to Postgres,
|
|
and generate secondary indexes around the IPLD data.
|
|
|
|
The schema and migrations for this Postgres database are provided in `statediff/db/`.
|
|
|
|
#### Postgres setup
|
|
We use [pressly/goose](https://github.com/pressly/goose) as our Postgres migration manager.
|
|
You can also load the Postgres schema directly into a database using
|
|
|
|
`psql database_name < schema.sql`
|
|
|
|
This will only work on a version 12.4 Postgres database.
|
|
|
|
#### Schema overview
|
|
Our Postgres schemas are built around a single IPFS backing Postgres IPLD blockstore table (`public.blocks`) that conforms with [go-ds-sql](https://github.com/ipfs/go-ds-sql/blob/master/postgres/postgres.go).
|
|
All IPLD objects are stored in this table, where `key` is the blockstore-prefixed multihash key for the IPLD object and `data` contains
|
|
the bytes for the IPLD block (in the case of all Ethereum IPLDs, this is the RLP byte encoding of the Ethereum object).
|
|
|
|
The IPLD objects in this table can be traversed using an IPLD DAG interface, but since this table only maps multihash to raw IPLD object
|
|
it is not particularly useful for searching through the data by looking up Ethereum objects by their constituent fields
|
|
(e.g. by block number, tx source/recipient, state/storage trie node path). To improve the accessibility of these objects
|
|
we create an Ethereum [advanced data layout](https://github.com/ipld/specs#schemas-and-advanced-data-layouts) (ADL) by generating secondary
|
|
indexes on top of the raw IPLDs in other Postgres tables.
|
|
|
|
These secondary index tables fall under the `eth` schema and follow an `{objectType}_cids` naming convention.
|
|
These tables provide a view into individual fields of the underlying Ethereum IPLD objects, allowing lookups on these fields, and reference the raw IPLD objects stored in `public.blocks`
|
|
by foreign keys to their multihash keys.
|
|
Additionally, these tables maintain the hash-linked nature of Ethereum objects to one another. E.g. a storage trie node entry in the `storage_cids`
|
|
table contains a `state_id` foreign key which references the `id` for the `state_cids` entry that contains the state leaf node for the contract that storage node belongs to,
|
|
and in turn that `state_cids` entry contains a `header_id` foreign key which references the `id` of the `header_cids` entry that contains the header for the block these state and storage nodes were updated (diffed).
|
|
|
|
### Optimization
|
|
On mainnet this process is extremely IO intensive and requires significant resources to allow it to keep up with the head of the chain.
|
|
The state diff processing time for a specific block is dependent on the number and complexity of the state changes that occur in a block and
|
|
the number of updated state nodes that are available in the in-memory cache vs must be retrieved from disc.
|
|
|
|
If memory permits, one means of improving the efficiency of this process is to increase the in-memory trie cache allocation.
|
|
This can be done by increasing the overall `--cache` allocation and/or by increasing the % of the cache allocated to trie
|
|
usage with `--cache.trie`. |