Tool for inserting the entire state and storage tries into PG-IPFS
Go to file
Michael 989bd1c0f1
Cerc refactor ()
* cerc refactor waiting on unpublished dependencies

* more cerc_refactor with published ipld-eth-db unstable

* TearDownDB refactor

* missed second publisher_test TearDownDB
2022-09-20 13:47:34 -04:00
.github/workflows Add command to take an in-place snapshot () 2022-06-15 12:51:26 +05:30
cmd Cerc refactor () 2022-09-20 13:47:34 -04:00
db/migrations remove diff unique constraint 2020-08-06 14:33:02 -05:00
environments Add command to take an in-place snapshot () 2022-06-15 12:51:26 +05:30
fixture Cerc refactor () 2022-09-20 13:47:34 -04:00
pkg Cerc refactor () 2022-09-20 13:47:34 -04:00
scripts Add helper scripts for data dump correction () 2022-08-17 15:14:14 +05:30
test Add command to take an in-place snapshot () 2022-06-15 12:51:26 +05:30
.gitignore Account selective snapshot () 2022-08-03 17:05:04 +05:30
docker-compose.yml Cerc refactor () 2022-09-20 13:47:34 -04:00
go.mod Cerc refactor () 2022-09-20 13:47:34 -04:00
go.sum Cerc refactor () 2022-09-20 13:47:34 -04:00
LICENSE cobra init 2020-06-30 11:53:18 -05:00
main.go Cerc refactor () 2022-09-20 13:47:34 -04:00
Makefile Upgrade geth () 2022-07-13 11:39:36 +05:30
README.md Add helper scripts for data dump correction () 2022-08-17 15:14:14 +05:30

ipld-eth-state-snapshot

Tool for extracting the entire Ethereum state at a particular block height from leveldb into Postgres-backed IPFS

Go Report Card

Setup

  • Build the binary:

    make build
    

Configuration

Config format:

[snapshot]
    mode         = "file"           # indicates output mode <postgres | file>
    workers      = 4                # degree of concurrency, the state trie is subdivided into sections that are traversed and processed concurrently
    blockHeight  = -1               # blockheight to perform the snapshot at (-1 indicates to use the latest blockheight found in leveldb)
    recoveryFile = "recovery_file"  # specifies a file to output recovery information on error or premature closure
    accounts = []                   # list of accounts (addresses) to take the snapshot for # SNAPSHOT_ACCOUNTS

[leveldb]
    # path to geth leveldb
    path    = "/Users/user/Library/Ethereum/geth/chaindata"         # LVL_DB_PATH
    # path to geth ancient database
    ancient = "/Users/user/Library/Ethereum/geth/chaindata/ancient" # ANCIENT_DB_PATH

[database]
    # when operating in 'postgres' output mode
    # db credentials
    name     = "vulcanize_public"   # DATABASE_NAME
    hostname = "localhost"          # DATABASE_HOSTNAME
    port     = 5432                 # DATABASE_PORT
    user     = "postgres"           # DATABASE_USER
    password = ""                   # DATABASE_PASSWORD

[file]
    # when operating in 'file' output mode
    # directory the CSV files are written to
    outputDir = "output_dir/"   # FILE_OUTPUT_DIR

[log]
    level = "info"      # log level (trace, debug, info, warn, error, fatal, panic) (default: info)
    file  = "log_file"  # file path for logging, leave unset to log to stdout

[prom]
    # prometheus metrics
    metrics  = true         # enable prometheus metrics         (default: false)
    http     = true         # enable prometheus http service    (default: false)
    httpAddr = "0.0.0.0"    # prometheus http host              (default: 127.0.0.1)
    httpPort = 9101         # prometheus http port              (default: 8086)
    dbStats  = true         # enable prometheus db stats        (default: false)

[ethereum]
    # node info
    clientName   = "Geth"   # ETH_CLIENT_NAME
    nodeID       = "arch1"  # ETH_NODE_ID
    networkID    = "1"      # ETH_NETWORK_ID
    chainID      = "1"      # ETH_CHAIN_ID
    genesisBlock = "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3" # ETH_GENESIS_BLOCK

Usage

  • For state snapshot from LevelDB:

    ./ipld-eth-state-snapshot stateSnapshot --config={path to toml config file}
    
    • Account selective snapshot: To restrict the snapshot to a list of accounts (addresses), provide the addresses in config parameter snapshot.accounts or env variable SNAPSHOT_ACCOUNTS. Only nodes related to provided addresses will be indexed.

      Example:

      [snapshot]
          accounts = [
              "0x825a6eec09e44Cb0fa19b84353ad0f7858d7F61a"
          ]
      
  • For in-place snapshot in the database:

    ./ipld-eth-state-snapshot inPlaceStateSnapshot --config={path to toml config file}
    

Monitoring

  • Enable metrics using config parameters prom.metrics and prom.http.
  • ipld-eth-state-snapshot exposes following prometheus metrics at /metrics endpoint:
    • state_node_count: Number of state nodes processed.
    • storage_node_count: Number of storage nodes processed.
    • code_node_count: Number of code nodes processed.
    • DB stats if operating in postgres mode.

Tests

  • Run unit tests:

    # setup db
    docker-compose up -d
    
    # run tests after db migrations are run
    make dbtest
    
    # tear down db
    docker-compose down -v --remove-orphans
    

Import output data in file mode into a database

  • When ipld-eth-state-snapshot stateSnapshot is run in file mode (database.type), the output is in form of CSV files.

  • Assuming the output files are located in host's ./output_dir directory.

  • Data post-processing:

    • Create a directory to store post-processed output:

      mkdir -p output_dir/processed_output
      
    • Combine output from multiple workers and copy to post-processed output directory:

      # public.blocks
      cat {output_dir,output_dir/*}/public.blocks.csv > output_dir/processed_output/combined-public.blocks.csv
      
      # eth.state_cids
      cat output_dir/*/eth.state_cids.csv > output_dir/processed_output/combined-eth.state_cids.csv
      
      # eth.storage_cids
      cat output_dir/*/eth.storage_cids.csv > output_dir/processed_output/combined-eth.storage_cids.csv
      
      # public.nodes
      cp output_dir/public.nodes.csv output_dir/processed_output/public.nodes.csv
      
      # eth.header_cids
      cp output_dir/eth.header_cids.csv output_dir/processed_output/eth.header_cids.csv
      
    • De-duplicate data:

      # public.blocks
      sort -u output_dir/processed_output/combined-public.blocks.csv -o output_dir/processed_output/deduped-combined-public.blocks.csv
      
      # eth.header_cids
      sort -u output_dir/processed_output/eth.header_cids.csv -o output_dir/processed_output/deduped-eth.header_cids.csv
      
      # eth.state_cids
      sort -u output_dir/processed_output/combined-eth.state_cids.csv -o output_dir/processed_output/deduped-combined-eth.state_cids.csv
      
      # eth.storage_cids
      sort -u output_dir/processed_output/combined-eth.storage_cids.csv -o output_dir/processed_output/deduped-combined-eth.storage_cids.csv
      
  • Copy over the post-processed output files to the DB server (say in /output_dir).

  • Start psql to run the import commands:

    psql -U <DATABASE_USER> -h <DATABASE_HOSTNAME> -p <DATABASE_PORT> <DATABASE_NAME>
    
  • Run the following to import data:

    # public.nodes
    COPY public.nodes FROM '/output_dir/processed_output/public.nodes.csv' CSV;
    
    # public.blocks
    COPY public.blocks FROM '/output_dir/processed_output/deduped-combined-public.blocks.csv' CSV;
    
    # eth.header_cids
    COPY eth.header_cids FROM '/output_dir/processed_output/deduped-eth.header_cids.csv' CSV;
    
    # eth.state_cids
    COPY eth.state_cids FROM '/output_dir/processed_output/deduped-combined-eth.state_cids.csv' CSV FORCE NOT NULL state_leaf_key;
    
    # eth.storage_cids
    COPY eth.storage_cids FROM '/output_dir/processed_output/deduped-combined-eth.storage_cids.csv' CSV FORCE NOT NULL storage_leaf_key;
    
  • NOTE: COPY command on CSVs inserts empty strings as NULL in the DB. Passing FORCE_NOT_NULL <COLUMN_NAME> forces it to insert empty strings instead. This is required to maintain compatibility of the imported snapshot data with the data generated by statediffing. Reference: https://www.postgresql.org/docs/14/sql-copy.html

Troubleshooting

  • Run the following command to find any rows (in data dumps in file mode) having unexpected number of columns:

    ./scripts/find-bad-rows.sh -i <input-file> -c <expected-columns> -o [output-file] -d true
    
  • Run the following command to select rows (from data dumps in file mode) other than the ones having unexpected number of columns:

    ./scripts/filter-bad-rows.sh -i <input-file> -c <expected-columns> -o <output-file>
    
  • See scripts for more details.