Investigate all uses of canonical_header_hash #251

New Issue

telackey · 2023-06-16T02:43:41Z

telackey commented

2023-06-16 02:43:41 +00:00

It is very easy to shoot yourself in the foot when using canonical_header_hash(), such that it gets executed in a nested loop thousands or millions of times.

We need to make sure that all current uses are efficient, and if they are not, either fix them, or make a serious adjustment such that we no longer need canonical_header_hash() at all, for example by marking canonical/non-canonical blocks rather than by executing a fn.

The latter would be ideal long term, because we don't really want to rely on complicated DB functions, since they are very difficult to observe or debug, but how urgent that change is would depend at least in part on determining how much of an issue it is at this time.

It is very easy to shoot yourself in the foot when using canonical_header_hash(), such that it gets executed in a nested loop thousands or millions of times. We need to make sure that all current uses are efficient, and if they are not, either fix them, or make a serious adjustment such that we no longer need canonical_header_hash() at all, for example by marking canonical/non-canonical blocks rather than by executing a fn. The latter would be ideal long term, because we don't really want to rely on complicated DB functions, since they are very difficult to observe or debug, but how urgent that change is would depend at least in part on determining how much of an issue it is at this time.

i-norden commented

2023-06-16 19:49:20 +00:00

I wonder how feasible it would be to modify the existing postgres queries so that they only apply their canonical_header_hash(block_number) condition if e.g. block_number >= MAX(block_number) - 64 , and what the performance overhead would then be like for queries that span both regions and span/target only the region below MAX-64.

I wonder how feasible it would be to modify the existing postgres queries so that they only apply their `canonical_header_hash(block_number)` condition if e.g. `block_number >= MAX(block_number) - 64` , and what the performance overhead would then be like for queries that span both regions and span/target only the region below `MAX-64`.

i-norden commented

2023-06-16 19:51:53 +00:00

This way we could continue to use it to weigh canonicity in the region of the chain that is still subject to reorgs (2 epochs, each with 32 slots-although they don't necessarily have to populated so not a perfect map to 64 execution layer blocks), and below that cutoff we have a mechanism for marking/cleaning out non-canon data.

i-norden commented

2023-06-16 20:31:00 +00:00

If we are okay with no longer persisting non-canon blocks for analytical purposes, we might be able to simply add a UC to eth.header_cids.block_number and let statediffing geth handle reorgs at the source by UPSERTings the old blocks away.

In this context, we would no longer use canonical_header_hash when querying the DB as whatever the single current state of the DB is (including within the 64 block "frothy" region) would be considered canonical (and would be consistent with what geth considers canonical).

The statediffing service simply subscribes to a stream of latest blocks from geth's core.BlockChain. This subscription includes blocks from reorgs, so the service can overwrite (or tag/shuffle somewhere else; if more desirable than simply UPSERTing) non-canon blocks. With a UC on eth.header_cids.block_number this would occur passively, I don't think we'd need to change anything other than some minor modifications to the current INSERT statements to add the upsert logic. But if we wanted to we could also do something more sophisticated in order to continue to persist the non-canon blocks for analytical purposes in some form/place.

In this later case where we don't rely on a DB constraint we need additional logic in the statediffing service to track when a block coming off the feed is a reorg block (the feed doesn't signal this itself). Since reorgs can't go below 2 epochs from head, I think one way to do this would be to maintain a map in memory of the latest ~64 (block_hash, block_number) tuples statediffed into the database and if we get a header off the subscription channel that has the same number but different hash of one of these processed blocks we know it is a reorg/replacement.

This all relies on the assumption that geth is handling reorgs as expected.

If we are okay with no longer persisting non-canon blocks for analytical purposes, we might be able to simply add a UC to `eth.header_cids.block_number` and let statediffing geth handle reorgs at the source by UPSERTings the old blocks away. In this context, we would no longer use `canonical_header_hash` when querying the DB as whatever the single current state of the DB is (including within the 64 block "frothy" region) would be considered canonical (and would be consistent with what geth considers canonical). The statediffing service simply subscribes to a stream of latest blocks from geth's `core.BlockChain`. This subscription includes blocks from reorgs, so the service can overwrite (or tag/shuffle somewhere else; if more desirable than simply UPSERTing) non-canon blocks. With a UC on `eth.header_cids.block_number` this would occur passively, I don't think we'd need to change anything other than some minor modifications to the current `INSERT` statements to add the upsert logic. But if we wanted to we could also do something more sophisticated in order to continue to persist the non-canon blocks for analytical purposes in some form/place. In this later case where we don't rely on a DB constraint we need additional logic in the statediffing service to track when a block coming off the feed is a reorg block (the feed doesn't signal this itself). Since reorgs can't go below 2 epochs from head, I think one way to do this would be to maintain a map in memory of the latest ~64 `(block_hash, block_number)` tuples statediffed into the database and if we get a header off the subscription channel that has the same number but different hash of one of these processed blocks we know it is a reorg/replacement. This all relies on the assumption that geth is handling reorgs as expected.

i-norden commented

2023-06-16 20:48:50 +00:00

Somewhat related: https://github.com/vulcanize/ops/issues/46#issuecomment-1474229179 we haven't seen any reorgs post-merge that are longer than 1 block.

i-norden commented

2023-06-17 01:26:15 +00:00

Since we can't have FKs in place in timescaleDB, so cannot cascade the UPSERT to remove all linked rows, relying on the unique constraint doesn't work. It would leave non-canonical records in the other tables.

dboreham commented

2023-06-17 02:59:29 +00:00

My gut thought is to not upsert, but instead retain the "dup" rows, but have a field to denote the most favored row.
Regular queries can sort on that field, limit 1 (gives you the good block if there is one, but still gives you a young block if it's too early to be canonical). Queries that are interested in the chaff blocks can not do that and get them all.
I can't think of a performance impact by not having the UC, and modifying data previously inserted gives me a queasy feeling from a consistency perspective (e.g. what if we crash in the middle of fixing up the old data, so now we need to be careful to have a transaction with the correct lifetime).

My gut thought is to not upsert, but instead retain the "dup" rows, but have a field to denote the most favored row. Regular queries can sort on that field, limit 1 (gives you the good block if there is one, but still gives you a young block if it's too early to be canonical). Queries that are interested in the chaff blocks can not do that and get them all. I can't think of a performance impact by not having the UC, and modifying data previously inserted gives me a queasy feeling from a consistency perspective (e.g. what if we crash in the middle of fixing up the old data, so now we need to be careful to have a transaction with the correct lifetime).

telackey commented

2023-10-04 17:48:31 +00:00

Fixed.

Sign in to join this conversation.

No Milestone

No project

No Assignees

3 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/ipld-eth-server#251