ipld-eth-state-snapshot/README.md

# ipld-eth-state-snapshot

> Tool for extracting the entire Ethereum state at a particular block height from leveldb into Postgres-backed IPFS

[![Go Report Card](https://goreportcard.com/badge/github.com/vulcanize/ipld-eth-state-snapshot)](https://goreportcard.com/report/github.com/vulcanize/ipld-eth-state-snapshot)

## Setup

* Build the binary:

    ```bash
    make build
    ```

## Configuration

Config format:

```toml
[snapshot]
    mode         = "file"           # indicates output mode <postgres | file>
    workers      = 4                # degree of concurrency, the state trie is subdivided into sections that are traversed and processed concurrently
    blockHeight  = -1               # blockheight to perform the snapshot at (-1 indicates to use the latest blockheight found in leveldb)
    recoveryFile = "recovery_file"  # specifies a file to output recovery information on error or premature closure
    accounts = []                   # list of accounts (addresses) to take the snapshot for # SNAPSHOT_ACCOUNTS

[leveldb]
    # path to geth leveldb
    path    = "/Users/user/Library/Ethereum/geth/chaindata"         # LVL_DB_PATH
    # path to geth ancient database
    ancient = "/Users/user/Library/Ethereum/geth/chaindata/ancient" # ANCIENT_DB_PATH

[database]
    # when operating in 'postgres' output mode
    # db credentials
    name     = "vulcanize_public"   # DATABASE_NAME
    hostname = "localhost"          # DATABASE_HOSTNAME
    port     = 5432                 # DATABASE_PORT
    user     = "postgres"           # DATABASE_USER
    password = ""                   # DATABASE_PASSWORD

[file]
    # when operating in 'file' output mode
    # directory the CSV files are written to
    outputDir = "output_dir/"   # FILE_OUTPUT_DIR

[log]
    level = "info"      # log level (trace, debug, info, warn, error, fatal, panic) (default: info)
    file  = "log_file"  # file path for logging, leave unset to log to stdout

[prom]
    # prometheus metrics
    metrics  = true         # enable prometheus metrics         (default: false)
    http     = true         # enable prometheus http service    (default: false)
    httpAddr = "0.0.0.0"    # prometheus http host              (default: 127.0.0.1)
    httpPort = 9101         # prometheus http port              (default: 8086)
    dbStats  = true         # enable prometheus db stats        (default: false)

[ethereum]
    # node info
    clientName   = "Geth"   # ETH_CLIENT_NAME
    nodeID       = "arch1"  # ETH_NODE_ID
    networkID    = "1"      # ETH_NETWORK_ID
    chainID      = "1"      # ETH_CHAIN_ID
    genesisBlock = "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3" # ETH_GENESIS_BLOCK
```

## Usage

* For state snapshot from LevelDB:

    ```bash
    ./ipld-eth-state-snapshot stateSnapshot --config={path to toml config file}
    ```

    * Account selective snapshot: To restrict the snapshot to a list of accounts (addresses), provide the addresses in config parameter `snapshot.accounts` or env variable `SNAPSHOT_ACCOUNTS`. Only nodes related to provided addresses will be indexed.

        Example:

        ```toml
        [snapshot]
            accounts = [
                "0x825a6eec09e44Cb0fa19b84353ad0f7858d7F61a"
            ]
        ```

* For in-place snapshot in the database:

    ```bash
    ./ipld-eth-state-snapshot inPlaceStateSnapshot --config={path to toml config file}
    ```

## Monitoring

* Enable metrics using config parameters `prom.metrics` and `prom.http`.
* `ipld-eth-state-snapshot` exposes following prometheus metrics at `/metrics` endpoint:
    * `state_node_count`: Number of state nodes processed.
    * `storage_node_count`: Number of storage nodes processed.
    * `code_node_count`: Number of code nodes processed.
    * DB stats if operating in `postgres` mode.

## Tests

* Run unit tests:

    ```bash
    # setup db
    docker-compose up -d

    # run tests after db migrations are run
    make dbtest

    # tear down db
    docker-compose down -v --remove-orphans
    ```

## Import output data in file mode into a database

* When `ipld-eth-state-snapshot stateSnapshot` is run in file mode (`database.type`), the output is in form of CSV files.

* Assuming the output files are located in host's `./output_dir` directory.

* Data post-processing:

    * Create a directory to store post-processed output:

        ```bash
        mkdir -p output_dir/processed_output
        ```

    * Combine output from multiple workers and copy to post-processed output directory:

        ```bash
        # public.blocks
        cat {output_dir,output_dir/*}/public.blocks.csv > output_dir/processed_output/combined-public.blocks.csv

        # eth.state_cids
        cat output_dir/*/eth.state_cids.csv > output_dir/processed_output/combined-eth.state_cids.csv

        # eth.storage_cids
        cat output_dir/*/eth.storage_cids.csv > output_dir/processed_output/combined-eth.storage_cids.csv

        # public.nodes
        cp output_dir/public.nodes.csv output_dir/processed_output/public.nodes.csv

        # eth.header_cids
        cp output_dir/eth.header_cids.csv output_dir/processed_output/eth.header_cids.csv
        ```

    * De-duplicate data:

        ```bash
        # public.blocks
        sort -u output_dir/processed_output/combined-public.blocks.csv -o output_dir/processed_output/deduped-combined-public.blocks.csv

        # eth.header_cids
        sort -u output_dir/processed_output/eth.header_cids.csv -o output_dir/processed_output/deduped-eth.header_cids.csv

        # eth.state_cids
        sort -u output_dir/processed_output/combined-eth.state_cids.csv -o output_dir/processed_output/deduped-combined-eth.state_cids.csv

        # eth.storage_cids
        sort -u output_dir/processed_output/combined-eth.storage_cids.csv -o output_dir/processed_output/deduped-combined-eth.storage_cids.csv
        ```

* Copy over the post-processed output files to the DB server (say in `/output_dir`).

* Start `psql` to run the import commands:

    ```bash
    psql -U <DATABASE_USER> -h <DATABASE_HOSTNAME> -p <DATABASE_PORT> <DATABASE_NAME>
    ```

* Run the following to import data:

    ```bash
    # public.nodes
    COPY public.nodes FROM '/output_dir/processed_output/public.nodes.csv' CSV;

    # public.blocks
    COPY public.blocks FROM '/output_dir/processed_output/deduped-combined-public.blocks.csv' CSV;

    # eth.header_cids
    COPY eth.header_cids FROM '/output_dir/processed_output/deduped-eth.header_cids.csv' CSV;

    # eth.state_cids
    COPY eth.state_cids FROM '/output_dir/processed_output/deduped-combined-eth.state_cids.csv' CSV FORCE NOT NULL state_leaf_key;

    # eth.storage_cids
    COPY eth.storage_cids FROM '/output_dir/processed_output/deduped-combined-eth.storage_cids.csv' CSV FORCE NOT NULL storage_leaf_key;
    ```

* NOTE: `COPY` command on CSVs inserts empty strings as `NULL` in the DB. Passing `FORCE_NOT_NULL <COLUMN_NAME>` forces it to insert empty strings instead. This is required to maintain compatibility of the imported snapshot data with the data generated by statediffing. Reference: https://www.postgresql.org/docs/14/sql-copy.html

### Troubleshooting

* Run the following command to find any rows (in data dumps in `file` mode) having unexpected number of columns:

    ```bash
    ./scripts/find-bad-rows.sh -i <input-file> -c <expected-columns> -o [output-file] -d true
    ```

* Run the following command to select rows (from data dumps in `file` mode) other than the ones having unexpected number of columns:

    ```bash
    ./scripts/filter-bad-rows.sh -i <input-file> -c <expected-columns> -o <output-file>
    ```

* See [scripts](./scripts) for more details.
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00			`# ipld-eth-state-snapshot`
go report card 2020-07-01 19:16:21 +00:00
Update README.md 2020-07-13 11:20:03 +00:00			`> Tool for extracting the entire Ethereum state at a particular block height from leveldb into Postgres-backed IPFS`
go report card 2020-07-01 19:16:21 +00:00
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00			`[![Go Report Card](https://goreportcard.com/badge/github.com/vulcanize/ipld-eth-state-snapshot)](https://goreportcard.com/report/github.com/vulcanize/ipld-eth-state-snapshot)`
use only blocknumber; minor changes 2020-07-15 04:43:11 +00:00
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`## Setup`
use only blocknumber; minor changes 2020-07-15 04:43:11 +00:00
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`* Build the binary:`
Add command to take an in-place snapshot (#42) * Add command to take an in-place snapshot * Add test data for in place snapshot unit test * Implement unit test for inplace snapshot * Add check for storage IPLD * Run unit tests sequentially * Add github workflow for unit test * Add missing checks for state and storage cid fields * Add more storage nodes to test * Update ipld-eth-db version for tests * Add comments for inplace snapshot test data * Add in-place snapshot cmd in readme * Implement defer pattern for db transaction * Log transaction commit or rollback error Co-authored-by: nabarun <nabarun@deepstacksoft.com> 2022-06-15 07:21:26 +00:00
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			```bash
			`make build`
			```
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`## Configuration`
use only blocknumber; minor changes 2020-07-15 04:43:11 +00:00
			`Config format:`

			```toml
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00			`[snapshot]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`mode = "file" # indicates output mode <postgres \| file>`
			`workers = 4 # degree of concurrency, the state trie is subdivided into sections that are traversed and processed concurrently`
			`blockHeight = -1 # blockheight to perform the snapshot at (-1 indicates to use the latest blockheight found in leveldb)`
			`recoveryFile = "recovery_file" # specifies a file to output recovery information on error or premature closure`
Account selective snapshot (#46) * snapshotter ignores nodes not along a path along those derived from a list of account addresses if one is provided * config and env updates * cmd update * Encode watched address path bytes to hex for comparison * actually ignore the subtries that are not along the paths of interest * Fixes for account selective snapshot * Use non-concurrent iterator when having a single worker * Only index root node when starting path of an iterator is nil * Upgrade deps * Avoid tracking iterators and skip recovery test * Fix recovery mechanism, use sync Map instead of buffered channels * Add test for account selective snapshot * Continue traversal with concurrent iterators with starting path nil * Use errgroup to simplify error handling with concurrent iterators * Check if all the nodes are indexed in the recovery test * Use concurrency safe sync Map in account selective snapshot test * Only track concurrent iterators and refactor code * Fix node and recovered path comparison * Revert back to using buffered channels for tracking iterators * Add a metric to monitor number of active iterators * Update docs * Update seeked path after node is processed * Return error on context cancellation from subtrie iteration * Add tests for account selective snapshot recovery * Explicity enforce concurrent iterator bounds to avoid duplicate nodes * Update full snapshot test to check nodes being indexed * Refactor code to simplify snapshot logic * Remove unnecessary function argument * Use ctx cancellation for handling signals * Add descriptive comments Co-authored-by: prathamesh0 <prathamesh.musale0@gmail.com> 2022-08-03 11:35:04 +00:00			`accounts = [] # list of accounts (addresses) to take the snapshot for # SNAPSHOT_ACCOUNTS`
use only blocknumber; minor changes 2020-07-15 04:43:11 +00:00
			`[leveldb]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# path to geth leveldb`
Fix typo in config format in README (#53) 2022-07-18 11:18:58 +00:00			`path = "/Users/user/Library/Ethereum/geth/chaindata" # LVL_DB_PATH`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# path to geth ancient database`
Fix typo in config format in README (#53) 2022-07-18 11:18:58 +00:00			`ancient = "/Users/user/Library/Ethereum/geth/chaindata/ancient" # ANCIENT_DB_PATH`
use only blocknumber; minor changes 2020-07-15 04:43:11 +00:00
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00			`[database]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# when operating in 'postgres' output mode`
			`# db credentials`
			`name = "vulcanize_public" # DATABASE_NAME`
			`hostname = "localhost" # DATABASE_HOSTNAME`
			`port = 5432 # DATABASE_PORT`
			`user = "postgres" # DATABASE_USER`
			`password = "" # DATABASE_PASSWORD`
add some logs and guards, update module name, update readme 2022-03-30 23:57:30 +00:00
			`[file]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# when operating in 'file' output mode`
			`# directory the CSV files are written to`
			`outputDir = "output_dir/" # FILE_OUTPUT_DIR`
Add prometheus metrics collection (#33) * Upgrade geth * Add prometheus metrics collection * Update README 2022-05-23 11:26:48 +00:00
			`[log]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`level = "info" # log level (trace, debug, info, warn, error, fatal, panic) (default: info)`
			`file = "log_file" # file path for logging, leave unset to log to stdout`
Add prometheus metrics collection (#33) * Upgrade geth * Add prometheus metrics collection * Update README 2022-05-23 11:26:48 +00:00
			`[prom]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# prometheus metrics`
			`metrics = true # enable prometheus metrics (default: false)`
			`http = true # enable prometheus http service (default: false)`
			`httpAddr = "0.0.0.0" # prometheus http host (default: 127.0.0.1)`
			`httpPort = 9101 # prometheus http port (default: 8086)`
			`dbStats = true # enable prometheus db stats (default: false)`
Add ethereum config to fix db import in file mode 2022-05-25 12:45:08 +00:00
			`[ethereum]`
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`# node info`
			`clientName = "Geth" # ETH_CLIENT_NAME`
			`nodeID = "arch1" # ETH_NODE_ID`
			`networkID = "1" # ETH_NETWORK_ID`
			`chainID = "1" # ETH_CHAIN_ID`
			`genesisBlock = "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3" # ETH_GENESIS_BLOCK`
doc 2020-08-20 10:25:11 +00:00			```
Log progress info 2022-05-26 10:20:42 +00:00
Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`## Usage`

			`* For state snapshot from LevelDB:`

			```bash
			`./ipld-eth-state-snapshot stateSnapshot --config={path to toml config file}`
			```

Account selective snapshot (#46) * snapshotter ignores nodes not along a path along those derived from a list of account addresses if one is provided * config and env updates * cmd update * Encode watched address path bytes to hex for comparison * actually ignore the subtries that are not along the paths of interest * Fixes for account selective snapshot * Use non-concurrent iterator when having a single worker * Only index root node when starting path of an iterator is nil * Upgrade deps * Avoid tracking iterators and skip recovery test * Fix recovery mechanism, use sync Map instead of buffered channels * Add test for account selective snapshot * Continue traversal with concurrent iterators with starting path nil * Use errgroup to simplify error handling with concurrent iterators * Check if all the nodes are indexed in the recovery test * Use concurrency safe sync Map in account selective snapshot test * Only track concurrent iterators and refactor code * Fix node and recovered path comparison * Revert back to using buffered channels for tracking iterators * Add a metric to monitor number of active iterators * Update docs * Update seeked path after node is processed * Return error on context cancellation from subtrie iteration * Add tests for account selective snapshot recovery * Explicity enforce concurrent iterator bounds to avoid duplicate nodes * Update full snapshot test to check nodes being indexed * Refactor code to simplify snapshot logic * Remove unnecessary function argument * Use ctx cancellation for handling signals * Add descriptive comments Co-authored-by: prathamesh0 <prathamesh.musale0@gmail.com> 2022-08-03 11:35:04 +00:00			* Account selective snapshot: To restrict the snapshot to a list of accounts (addresses), provide the addresses in config parameter `snapshot.accounts` or env variable `SNAPSHOT_ACCOUNTS`. Only nodes related to provided addresses will be indexed.

			`Example:`

			```toml
			`[snapshot]`
			`accounts = [`
			`"0x825a6eec09e44Cb0fa19b84353ad0f7858d7F61a"`
			`]`
			```

Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`* For in-place snapshot in the database:`

			```bash
			`./ipld-eth-state-snapshot inPlaceStateSnapshot --config={path to toml config file}`
			```

			`## Monitoring`

			* Enable metrics using config parameters `prom.metrics` and `prom.http`.
			* `ipld-eth-state-snapshot` exposes following prometheus metrics at `/metrics` endpoint:
			* `state_node_count`: Number of state nodes processed.
			* `storage_node_count`: Number of storage nodes processed.
			* `code_node_count`: Number of code nodes processed.
			* DB stats if operating in `postgres` mode.

Log progress info 2022-05-26 10:20:42 +00:00			`## Tests`

Add instructions to import snapshot data into database (#52) * Add instructions to import snapshot data into database * Add monitoring and update data processing in README * Update instructions to import snapshot 2022-07-18 10:00:23 +00:00			`* Run unit tests:`

			```bash
			`# setup db`
			`docker-compose up -d`

			`# run tests after db migrations are run`
			`make dbtest`

			`# tear down db`
			`docker-compose down -v --remove-orphans`
			```

			`## Import output data in file mode into a database`

			* When `ipld-eth-state-snapshot stateSnapshot` is run in file mode (`database.type`), the output is in form of CSV files.

			* Assuming the output files are located in host's `./output_dir` directory.

			`* Data post-processing:`

			`* Create a directory to store post-processed output:`

			```bash
			`mkdir -p output_dir/processed_output`
			```

			`* Combine output from multiple workers and copy to post-processed output directory:`

			```bash
			`# public.blocks`
			`cat {output_dir,output_dir/*}/public.blocks.csv > output_dir/processed_output/combined-public.blocks.csv`

			`# eth.state_cids`
			`cat output_dir/*/eth.state_cids.csv > output_dir/processed_output/combined-eth.state_cids.csv`

			`# eth.storage_cids`
			`cat output_dir/*/eth.storage_cids.csv > output_dir/processed_output/combined-eth.storage_cids.csv`

			`# public.nodes`
			`cp output_dir/public.nodes.csv output_dir/processed_output/public.nodes.csv`

			`# eth.header_cids`
			`cp output_dir/eth.header_cids.csv output_dir/processed_output/eth.header_cids.csv`
			```

			`* De-duplicate data:`

			```bash
			`# public.blocks`
			`sort -u output_dir/processed_output/combined-public.blocks.csv -o output_dir/processed_output/deduped-combined-public.blocks.csv`

			`# eth.header_cids`
			`sort -u output_dir/processed_output/eth.header_cids.csv -o output_dir/processed_output/deduped-eth.header_cids.csv`

			`# eth.state_cids`
			`sort -u output_dir/processed_output/combined-eth.state_cids.csv -o output_dir/processed_output/deduped-combined-eth.state_cids.csv`

			`# eth.storage_cids`
			`sort -u output_dir/processed_output/combined-eth.storage_cids.csv -o output_dir/processed_output/deduped-combined-eth.storage_cids.csv`
			```

			* Copy over the post-processed output files to the DB server (say in `/output_dir`).

			* Start `psql` to run the import commands:

			```bash
			`psql -U <DATABASE_USER> -h <DATABASE_HOSTNAME> -p <DATABASE_PORT> <DATABASE_NAME>`
			```

			`* Run the following to import data:`

			```bash
			`# public.nodes`
			`COPY public.nodes FROM '/output_dir/processed_output/public.nodes.csv' CSV;`

			`# public.blocks`
			`COPY public.blocks FROM '/output_dir/processed_output/deduped-combined-public.blocks.csv' CSV;`

			`# eth.header_cids`
			`COPY eth.header_cids FROM '/output_dir/processed_output/deduped-eth.header_cids.csv' CSV;`

			`# eth.state_cids`
			`COPY eth.state_cids FROM '/output_dir/processed_output/deduped-combined-eth.state_cids.csv' CSV FORCE NOT NULL state_leaf_key;`

			`# eth.storage_cids`
			`COPY eth.storage_cids FROM '/output_dir/processed_output/deduped-combined-eth.storage_cids.csv' CSV FORCE NOT NULL storage_leaf_key;`
			```

			* NOTE: `COPY` command on CSVs inserts empty strings as `NULL` in the DB. Passing `FORCE_NOT_NULL <COLUMN_NAME>` forces it to insert empty strings instead. This is required to maintain compatibility of the imported snapshot data with the data generated by statediffing. Reference: https://www.postgresql.org/docs/14/sql-copy.html
Add helper scripts for data dump correction (#57) * Add a script to find bad data in CSV file dumps * Add a script to delete bad rows from CSV file dumps * Add instructions to run the scripts * Reorganize instructions 2022-08-17 09:44:14 +00:00
			`### Troubleshooting`

			* Run the following command to find any rows (in data dumps in `file` mode) having unexpected number of columns:

			```bash
			`./scripts/find-bad-rows.sh -i <input-file> -c <expected-columns> -o [output-file] -d true`
			```

			* Run the following command to select rows (from data dumps in `file` mode) other than the ones having unexpected number of columns:

			```bash
			`./scripts/filter-bad-rows.sh -i <input-file> -c <expected-columns> -o <output-file>`
			```

			`* See [scripts](./scripts) for more details.`