be544a3424
* Add a script to find bad data in CSV file dumps * Add a script to delete bad rows from CSV file dumps * Add instructions to run the scripts * Reorganize instructions
2.8 KiB
2.8 KiB
Data Validation
- For a given table in the
ipld-eth-db
schema, we know the number of columns to be expected in each row in the data dump:Table Expected columns public.nodes 5 public.blocks 3 eth.header_cids 16 eth.state_cids 8 eth.storage_cids 9
Find Bad Data
-
Run the following command to find any rows having unexpected number of columns:
./scripts/find-bad-rows.sh -i <input-file> -c <expected-columns> -o [output-file] -d [include-data]
-
input-file
-i
: Input data file path -
expected-columns
-c
: Expected number of columns in each row of the input file -
output-file
-o
: Output destination file path (default:STDOUT
) -
include-data
-d
: Whether to include the data row in the output (true | false
) (default:false
) -
The output is of format: row number, number of columns, the data row
Eg:
./scripts/find-bad-rows.sh -i eth.state_cids.csv -c 8 -o res.txt -d true
Output:
1 9 1500000,xxxxxxxx,0x83952d392f9b0059eea94b10d1a095eefb1943ea91595a16c6698757127d4e1c,,baglacgzasvqcntdahkxhufdnkm7a22s2eetj6mx6nzkarwxtkvy4x3bubdgq,\x0f,0,f,/blocks/,DMQJKYBGZRQDVLT2CRWVGPQNNJNCCJU7GL7G4VAI3LZVK4OL5Q2ARTI
Eg:
./scripts/find-bad-rows.sh -i public.nodes.csv -c 5 -o res.txt -d true ./scripts/find-bad-rows.sh -i public.blocks.csv -c 3 -o res.txt -d true ./scripts/find-bad-rows.sh -i eth.header_cids.csv -c 16 -o res.txt -d true ./scripts/find-bad-rows.sh -i eth.state_cids.csv -c 8 -o res.txt -d true ./scripts/find-bad-rows.sh -i eth.storage_cids.csv -c 9 -o res.txt -d true
-
Data Cleanup
- In case of column count mismatch, data from
file
mode dumps can't be imported readily intoipld-eth-db
.
Filter Bad Data
-
Run the following command to filter out rows having unexpected number of columns:
./scripts/filter-bad-rows.sh -i <input-file> -c <expected-columns> -o <output-file>
-
input-file
-i
: Input data file path -
expected-columns
-c
: Expected number of columns in each row of the input file -
output-file
-o
: Output destination file pathEg:
./scripts/filter-bad-rows.sh -i public.nodes.csv -c 5 -o cleaned-public.nodes.csv ./scripts/filter-bad-rows.sh -i public.blocks.csv -c 3 -o cleaned-public.blocks.csv ./scripts/filter-bad-rows.sh -i eth.header_cids.csv -c 16 -o cleaned-eth.header_cids.csv ./scripts/filter-bad-rows.sh -i eth.state_cids.csv -c 8 -o cleaned-eth.state_cids.csv ./scripts/filter-bad-rows.sh -i eth.storage_cids.csv -c 9 -o cleaned-eth.storage_cids.csv
-