Investigate all uses of canonical_header_hash #251

Closed
opened 2023-06-16 02:43:41 +00:00 by telackey · 7 comments
Contributor

It is very easy to shoot yourself in the foot when using canonical_header_hash(), such that it gets executed in a nested loop thousands or millions of times.

We need to make sure that all current uses are efficient, and if they are not, either fix them, or make a serious adjustment such that we no longer need canonical_header_hash() at all, for example by marking canonical/non-canonical blocks rather than by executing a fn.

The latter would be ideal long term, because we don't really want to rely on complicated DB functions, since they are very difficult to observe or debug, but how urgent that change is would depend at least in part on determining how much of an issue it is at this time.

It is very easy to shoot yourself in the foot when using canonical_header_hash(), such that it gets executed in a nested loop thousands or millions of times. We need to make sure that all current uses are efficient, and if they are not, either fix them, or make a serious adjustment such that we no longer need canonical_header_hash() at all, for example by marking canonical/non-canonical blocks rather than by executing a fn. The latter would be ideal long term, because we don't really want to rely on complicated DB functions, since they are very difficult to observe or debug, but how urgent that change is would depend at least in part on determining how much of an issue it is at this time.
Member

I wonder how feasible it would be to modify the existing postgres queries so that they only apply their canonical_header_hash(block_number) condition if e.g. block_number >= MAX(block_number) - 64 , and what the performance overhead would then be like for queries that span both regions and span/target only the region below MAX-64.

I wonder how feasible it would be to modify the existing postgres queries so that they only apply their `canonical_header_hash(block_number)` condition if e.g. `block_number >= MAX(block_number) - 64` , and what the performance overhead would then be like for queries that span both regions and span/target only the region below `MAX-64`.
Member

This way we could continue to use it to weigh canonicity in the region of the chain that is still subject to reorgs (2 epochs, each with 32 slots-although they don't necessarily have to populated so not a perfect map to 64 execution layer blocks), and below that cutoff we have a mechanism for marking/cleaning out non-canon data.

This way we could continue to use it to weigh canonicity in the region of the chain that is still subject to reorgs (2 epochs, each with 32 slots-although they don't necessarily have to populated so not a perfect map to 64 execution layer blocks), and below that cutoff we have a mechanism for marking/cleaning out non-canon data.
Member

If we are okay with no longer persisting non-canon blocks for analytical purposes, we might be able to simply add a UC to eth.header_cids.block_number and let statediffing geth handle reorgs at the source by UPSERTings the old blocks away.

In this context, we would no longer use canonical_header_hash when querying the DB as whatever the single current state of the DB is (including within the 64 block "frothy" region) would be considered canonical (and would be consistent with what geth considers canonical).

The statediffing service simply subscribes to a stream of latest blocks from geth's core.BlockChain. This subscription includes blocks from reorgs, so the service can overwrite (or tag/shuffle somewhere else; if more desirable than simply UPSERTing) non-canon blocks. With a UC on eth.header_cids.block_number this would occur passively, I don't think we'd need to change anything other than some minor modifications to the current INSERT statements to add the upsert logic. But if we wanted to we could also do something more sophisticated in order to continue to persist the non-canon blocks for analytical purposes in some form/place.

In this later case where we don't rely on a DB constraint we need additional logic in the statediffing service to track when a block coming off the feed is a reorg block (the feed doesn't signal this itself). Since reorgs can't go below 2 epochs from head, I think one way to do this would be to maintain a map in memory of the latest ~64 (block_hash, block_number) tuples statediffed into the database and if we get a header off the subscription channel that has the same number but different hash of one of these processed blocks we know it is a reorg/replacement.

This all relies on the assumption that geth is handling reorgs as expected.

If we are okay with no longer persisting non-canon blocks for analytical purposes, we might be able to simply add a UC to `eth.header_cids.block_number` and let statediffing geth handle reorgs at the source by UPSERTings the old blocks away. In this context, we would no longer use `canonical_header_hash` when querying the DB as whatever the single current state of the DB is (including within the 64 block "frothy" region) would be considered canonical (and would be consistent with what geth considers canonical). The statediffing service simply subscribes to a stream of latest blocks from geth's `core.BlockChain`. This subscription includes blocks from reorgs, so the service can overwrite (or tag/shuffle somewhere else; if more desirable than simply UPSERTing) non-canon blocks. With a UC on `eth.header_cids.block_number` this would occur passively, I don't think we'd need to change anything other than some minor modifications to the current `INSERT` statements to add the upsert logic. But if we wanted to we could also do something more sophisticated in order to continue to persist the non-canon blocks for analytical purposes in some form/place. In this later case where we don't rely on a DB constraint we need additional logic in the statediffing service to track when a block coming off the feed is a reorg block (the feed doesn't signal this itself). Since reorgs can't go below 2 epochs from head, I think one way to do this would be to maintain a map in memory of the latest ~64 `(block_hash, block_number)` tuples statediffed into the database and if we get a header off the subscription channel that has the same number but different hash of one of these processed blocks we know it is a reorg/replacement. This all relies on the assumption that geth is handling reorgs as expected.
Member

Somewhat related: https://github.com/vulcanize/ops/issues/46#issuecomment-1474229179 we haven't seen any reorgs post-merge that are longer than 1 block.

Somewhat related: https://github.com/vulcanize/ops/issues/46#issuecomment-1474229179 we haven't seen any reorgs post-merge that are longer than 1 block.
Member

Since we can't have FKs in place in timescaleDB, so cannot cascade the UPSERT to remove all linked rows, relying on the unique constraint doesn't work. It would leave non-canonical records in the other tables.

Since we can't have FKs in place in timescaleDB, so cannot cascade the UPSERT to remove all linked rows, relying on the unique constraint doesn't work. It would leave non-canonical records in the other tables.

My gut thought is to not upsert, but instead retain the "dup" rows, but have a field to denote the most favored row.
Regular queries can sort on that field, limit 1 (gives you the good block if there is one, but still gives you a young block if it's too early to be canonical). Queries that are interested in the chaff blocks can not do that and get them all.
I can't think of a performance impact by not having the UC, and modifying data previously inserted gives me a queasy feeling from a consistency perspective (e.g. what if we crash in the middle of fixing up the old data, so now we need to be careful to have a transaction with the correct lifetime).

My gut thought is to not upsert, but instead retain the "dup" rows, but have a field to denote the most favored row. Regular queries can sort on that field, limit 1 (gives you the good block if there is one, but still gives you a young block if it's too early to be canonical). Queries that are interested in the chaff blocks can not do that and get them all. I can't think of a performance impact by not having the UC, and modifying data previously inserted gives me a queasy feeling from a consistency perspective (e.g. what if we crash in the middle of fixing up the old data, so now we need to be careful to have a transaction with the correct lifetime).
Author
Contributor

Fixed.

Fixed.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/ipld-eth-server#251
No description provided.