Reorgs lead to invalid block hierarchy in the statediff database. #398

Closed
opened 2023-07-11 19:41:04 +00:00 by telackey · 3 comments
Member

The reorganization code is invoked in geth whenever more than one block is added to the chain at a time. One recent PR handled a bug where blocks were missed when the chain was purely extended (eg, the chain was at 1000, extended to 1003, and we skipped statediffing blocks 1001 and 1002).

The other case is a reorganization in the more formal sense, when we have a different sequence of blocks at the same height, and switch from and "old" chain to a "new" chain. For example, we have blocks [1000:0x123, 1001:0xabc] and are told by the consensus client to switch to [1000:0x123, 1001:0x456, 1002:0x789].

Note: I made the second chain longer, because in practice, the reorg is triggered by the consensus client informing us via engine_forkchoiceUpdatedV2 what should be the next head, which is almost always the next block.

This will cause a similar bug to the one mentioned above, but in this case we will have the wrong block stored in the database at 1001, rather than no block at all. The fix for #397 will not catch this, because that is only able to observe the gap in the sequence, and there is no gap in this instance.

I was able to induce reorgs with some regularity in the fixturenet by pausing the statediffing container for two slots, then unpausing it for one, in a loop.

In the DB, we then see that that this allows us to insert blocks without a parent:

cerc_testing=# select count(*) from eth.header_cids h where not exists (select block_number from eth.header_cids where block_hash = h.parent_hash);
 count
-------
    32

Here is a specific example:

cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703);
 block_number |                             block_hash                             |                            parent_hash
--------------+--------------------------------------------------------------------+--------------------------------------------------------------------
        16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6
        16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7
        16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d
(4 rows)

You can see that 16701 (0x18dc) points to 16700 (0x3d0b), and 16703 (0x710d) points to 16702 (0xdb3f), but 16702 points to a parent (0x406d) that does not exist. Yet there are no duplicated block numbers at 16701, it is just a missing entry.

This can be fixed by explicitly calling statediff_writeStateDiffAt(16701) after which the hierarchy is fixed. We see both the old, orphaned 16701 (0x18dc) and the new, canonical 16701 (0x406d).

cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703);
 block_number |                             block_hash                             |                            parent_hash
--------------+--------------------------------------------------------------------+--------------------------------------------------------------------
        16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6
        16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16701 | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1
        16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7
        16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d
(5 rows)

It is an important detail that we could have been informed about this, as the reorganization issues not only a ChainEvent for the new head block, but also ChainSideEvents for every block removed from the chain. We do not currently listen for ChainSideEvents however.

The reorganization code is invoked in geth whenever more than one block is added to the chain at a time. One recent [PR](https://github.com/cerc-io/go-ethereum/pull/397) handled a bug where blocks were missed when the chain was purely extended (eg, the chain was at 1000, extended to 1003, and we skipped statediffing blocks 1001 and 1002). The other case is a reorganization in the more formal sense, when we have a different sequence of blocks at the same height, and switch from and "old" chain to a "new" chain. For example, we have blocks `[1000:0x123, 1001:0xabc]` and are told by the consensus client to switch to `[1000:0x123, 1001:0x456, 1002:0x789]`. > Note: I made the second chain longer, because in practice, the reorg is triggered by the consensus client informing us via `engine_forkchoiceUpdatedV2` what should be the _next_ head, which is almost always the _next_ block. This will cause a similar bug to the one mentioned above, but in this case we will have the _wrong_ block stored in the database at 1001, rather than no block at all. The fix for #397 will not catch this, because that is only able to observe the gap in the sequence, and there is no gap in this instance. I was able to induce reorgs with some regularity in the fixturenet by pausing the statediffing container for two slots, then unpausing it for one, in a loop. In the DB, we then see that that this allows us to insert blocks without a parent: ``` cerc_testing=# select count(*) from eth.header_cids h where not exists (select block_number from eth.header_cids where block_hash = h.parent_hash); count ------- 32 ``` Here is a specific example: ``` cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703); block_number | block_hash | parent_hash --------------+--------------------------------------------------------------------+-------------------------------------------------------------------- 16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6 16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7 16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d (4 rows) ``` You can see that 16701 (0x18dc) points to 16700 (0x3d0b), and 16703 (0x710d) points to 16702 (0xdb3f), but 16702 points to a parent (0x406d) that does not exist. Yet there are no duplicated block numbers at 16701, it is just a missing entry. This can be fixed by explicitly calling `statediff_writeStateDiffAt(16701)` after which the hierarchy is fixed. We see both the old, orphaned 16701 (0x18dc) and the new, canonical 16701 (0x406d). ``` cerc_testing=# select block_number, block_hash, parent_hash from eth.header_cids where block_number in (16700, 16701, 16702, 16703); block_number | block_hash | parent_hash --------------+--------------------------------------------------------------------+-------------------------------------------------------------------- 16700 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 | 0x73aeb9b37322561d21d251cd5d0070bc4ef661ab941b9624359afb8c263ff7f6 16701 | 0x18dc5fe9c5b8a12e2e22ff33347b73fcd1314f2571c87a8820d20253efac15aa | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 16701 | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7 | 0x3d0ba0e96a89b24635aa3b3c2682dd8577a308bd5ec359e59fe0be5a3524f7f1 16702 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d | 0x406d1028301eb12ebfca0adf3febd26abbc3d6523e5cf3fa5b9082d9d6c2c1c7 16703 | 0x710d77076db42ef8e98010e0c131d43add947162441e354c523a987cc8a7e0b6 | 0xdb3fa74b7d15468fb8619e1b52090c77085d2993be2ebcad0201fcaaa61abe2d (5 rows) ``` It is an important detail that we _could_ have been informed about this, as the reorganization issues not only a `ChainEvent` for the new head block, but also `ChainSideEvent`s for every block removed from the chain. We do not currently listen for `ChainSideEvent`s however.
Author
Member

If we double-checked that we had the parent of each block we insert, it should serve as a fix for both the #397 issue and this one, while being more reliable and resilient than relying purely on event notification.

We'd need to handle the empty DB case to prevent chasing the rabbit all the way back to genesis, but otherwise it should make sure we do not allow gaps to open.

If we double-checked that we had the parent of each block we insert, it should serve as a fix for both the #397 issue and this one, while being more reliable and resilient than relying purely on event notification. We'd need to handle the empty DB case to prevent chasing the rabbit all the way back to genesis, but otherwise it should make sure we do not allow gaps to open.
Owner

If these additional checks can be done in a performant way, by all means, please add them. Relatedly, we should make sure that https://github.com/cerc-io/eth-ipfs-state-validator would have caught and corrected these issues.

If these additional checks can be done in a performant way, by all means, please add them. Relatedly, we should make sure that https://github.com/cerc-io/eth-ipfs-state-validator would have caught and corrected these issues.
Author
Member

I can verify that https://github.com/cerc-io/ipld-eth-db-validator/tree/v5 detects this issue, though it cannot correct it.

The reported error looks like:

ERRO[0003] failed to verify state root at block 1889
FATA[0003] sql: no rows in result set
I can verify that https://github.com/cerc-io/ipld-eth-db-validator/tree/v5 detects this issue, though it cannot correct it. The reported error looks like: ``` ERRO[0003] failed to verify state root at block 1889 FATA[0003] sql: no rows in result set ```
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/go-ethereum#398
No description provided.