Carrion PK conflict #367

New Issue

i-norden · 2023-04-13T18:55:18Z

i-norden commented

2023-04-13 18:55:18 +00:00

We are running into a PK conflict error on the Vulture/Carrion v4 setup when in the copyfrom=true mode

cht-snapshot-12353535-db_1  | 2023-04-13 17:47:20.765 UTC [807631] ERROR:  duplicate key value violates unique constraint "1397_1397_transaction_cids_pkey"
cht-snapshot-12353535-db_1  | 2023-04-13 17:47:20.765 UTC [807631] DETAIL:  Key (tx_hash, header_id, block_number)=(0xf0e66c5f4cfe3714bbe8cdc527064cc479e07463dd619d277c0ef195f90a3033, 0x65e4fc3aab253851e9094d143f478f4cece6f28221482594203a156b1319560c, 17039998) already exists.
cht-snapshot-12353535-db_1  | 2023-04-13 17:47:20.765 UTC [807631] CONTEXT:  COPY transaction_cids, line 1
cht-snapshot-12353535-db_1  | 2023-04-13 17:47:20.765 UTC [807631] STATEMENT:  copy "eth"."transaction_cids" ( "block_number", "header_id", "tx_hash", "cid", "dst", "src", "index", "mh_key", "tx_data", "tx_type", "value" ) from stdin binary;

This is due to combination of:

For some reason, we are re-processing a duplicate block. Cause of this is unclear, as even when a reorg occurs at the execution layer it should only include distinct blocks going back to the latest common ancestor block. Although, perhaps, we are reorging and then reorging again back to the original chain segment (or a new segment where some number of the first blocks overlap with the first blocks in the original chain segment) that was orphaned in the first reorg...
COPY FROM can't handle ON CONFLICT clauses, unless you do a work around like https://stackoverflow.com/questions/48019381/how-postgresql-copy-to-stdin-with-csv-do-on-conflic-do-update where you first COPY FROM into a temp table and then INSERT into the desired table using an ON CONFLICT clause.

We are running into a PK conflict error on the Vulture/Carrion v4 setup when in the `copyfrom=true` mode ``` cht-snapshot-12353535-db_1 | 2023-04-13 17:47:20.765 UTC [807631] ERROR: duplicate key value violates unique constraint "1397_1397_transaction_cids_pkey" cht-snapshot-12353535-db_1 | 2023-04-13 17:47:20.765 UTC [807631] DETAIL: Key (tx_hash, header_id, block_number)=(0xf0e66c5f4cfe3714bbe8cdc527064cc479e07463dd619d277c0ef195f90a3033, 0x65e4fc3aab253851e9094d143f478f4cece6f28221482594203a156b1319560c, 17039998) already exists. cht-snapshot-12353535-db_1 | 2023-04-13 17:47:20.765 UTC [807631] CONTEXT: COPY transaction_cids, line 1 cht-snapshot-12353535-db_1 | 2023-04-13 17:47:20.765 UTC [807631] STATEMENT: copy "eth"."transaction_cids" ( "block_number", "header_id", "tx_hash", "cid", "dst", "src", "index", "mh_key", "tx_data", "tx_type", "value" ) from stdin binary; ``` This is due to combination of: 1. For some reason, we are re-processing a duplicate block. Cause of this is unclear, as even when a reorg occurs at the execution layer it should only include distinct blocks going back to the latest common ancestor block. Although, perhaps, we are reorging and then reorging _again_ back to the original chain segment (or a new segment where some number of the first blocks overlap with the first blocks in the original chain segment) that was orphaned in the first reorg... 2. COPY FROM can't handle ON CONFLICT clauses, unless you do a work around like https://stackoverflow.com/questions/48019381/how-postgresql-copy-to-stdin-with-csv-do-on-conflic-do-update where you first COPY FROM into a temp table and then INSERT into the desired table using an ON CONFLICT clause.

i-norden commented

2023-04-13 19:14:12 +00:00

Since it only occurs once in a blue moon under mainnet conditions that can't be reliably reproduced, it will be difficult to explore and test the issue.

So I’m more inclined to either

Adjust the copyfrom mode to be able to handle CONFLICT ON clauses, using a temp table as mentioned above, or
Add some heuristic to make sure we don't reprocess a duplicate block. I.e. keep a small cache of latest block hashes processed (e.g. ~2 beacon chain epochs long), and if we see a duplicate hash passed to the statediffing service, we ignore it.

In the meantime we may want to revert to using copyfrom=false.

Since it only occurs once in a blue moon under mainnet conditions that can't be reliably reproduced, it will be difficult to explore and test the issue. So I’m more inclined to either 1. Adjust the `copyfrom` mode to be able to handle CONFLICT ON clauses, using a temp table as mentioned above, or 2. Add some heuristic to make sure we don't reprocess a duplicate block. I.e. keep a small cache of latest block hashes processed (e.g. ~2 beacon chain epochs long), and if we see a duplicate hash passed to the statediffing service, we ignore it. In the meantime we may want to revert to using `copyfrom=false`.

AFDudley commented

2023-04-14 17:03:42 +00:00

@dboreham just something to be aware of.

i-norden commented

2023-05-25 13:45:49 +00:00

I think we should elevate this to a somewhat high priority, until this is fixed we cannot rely on the COPY FROM mode.

I think we should elevate this to a somewhat high priority, until this is fixed we cannot rely on the `COPY FROM` mode.

telackey commented

2023-05-30 22:22:25 +00:00

Looking at the recent logs, here are a few interesting details:

All the errors occur in the span of a few minutes, after the server has already been up for hours.
The end of each errored operation is properly logged, but the next round of operations never complete.
The same block is sometimes statediffed more than once, but this is not due to multiple ChainEvents (I added logging for this), rather it seems to be from a combination of an internally-triggered statediff from a blockchain.ChainEvent and one or more externally triggered statediffs from statediff_writeStatediffAt API calls.

Which after some digging and intentional reproduction in the fixturenet leads me to a theory of what is happening:

A blockchain.ChainEvent is generated and normal head-tracking statediff begins.
Moments later an API request to statediff the same block begins, likely from ipld-eth-server, which calls writeStateDiffAt automatically under several conditions.
Both statediff operations are run, but only one TX succeeds in writing to the DB.
The errored operation fails to release its DB connection back to the pgx pool. Its END is properly logged however, so it doesn't exit completely sideways.
The next round of statediffs log their BEGIN, but never an END, because they are blocked waiting forever for the leaked DB connections to be returned to the (empty) DB connection pool.

I have confirmed that the DB conn is being leaked by intentionally triggering a similar error and observing that the pool gets exhausted, but I need to track down the exact source.

There are a few levels of fixes that need to occur here:

Error handling needs fixed so that no resources are leaked on error. This is the biggest issue.
We should not allow statediffing the same block in multiple workers at the same time.
We should not re-statediff a block we have already successfully statediffed (at least not as a rule).

Looking at the recent logs, here are a few interesting details: 1. All the errors occur in the span of a few minutes, after the server has already been up for hours. 2. The end of each errored operation is properly logged, but the _next_ round of operations never complete. 3. The same block is sometimes statediffed more than once, but this is not due to multiple ChainEvents (I added logging for this), rather it seems to be from a combination of an internally-triggered statediff from a `blockchain.ChainEvent` and one or more externally triggered statediffs from `statediff_writeStatediffAt` API calls. Which after some digging and intentional reproduction in the fixturenet leads me to a theory of what is happening: 1. A blockchain.ChainEvent is generated and normal head-tracking statediff begins. 2. Moments later an API request to statediff the same block begins, likely from `ipld-eth-server`, which calls writeStateDiffAt automatically under several conditions. 3. Both statediff operations are run, but only one TX succeeds in writing to the DB. 4. The errored operation fails to release its DB connection back to the pgx pool. Its `END` is properly logged however, so it doesn't exit completely sideways. 5. The next round of statediffs log their `BEGIN`, but never an `END`, because they are blocked waiting forever for the leaked DB connections to be returned to the (empty) DB connection pool. I have confirmed that the DB conn is being leaked by intentionally triggering a similar error and observing that the pool gets exhausted, but I need to track down the exact source. There are a few levels of fixes that need to occur here: 1. Error handling needs fixed so that no resources are leaked on error. This is the biggest issue. 2. We should not allow statediffing the same block in multiple workers at the same time. 3. We should not re-statediff a block we have already successfully statediffed (at least not as a rule).

👀 2 ❤️ 1

telackey commented

2023-06-01 18:23:32 +00:00

All 3 points are fixed in v5. I am backporting the fixes to v4 for deployment on vulture.