Investigate deadlock error #136

Closed
opened 2021-10-19 14:22:24 +00:00 by i-norden · 5 comments
Member
ERROR[10-18|00:45:00.658] statediff.Service.WriteLoop: processing error block height=4,122,922 error="error building diff for updated accounts: error publishing storage node IPLD: pq: deadlock detected" worker=17
ERROR[10-13|23:07:05.327] statediff.Service.WriteLoop: processing error block height=3,307,833 error="error publishing log trie node IPLD: pq: deadlock detected" worker=6

It only appears to be occurring for logTrie and storageTrie nodes, which is illuminating because these are the only nest tries. So it is likely do to an issue with the rctTrie or stateTrie leaf node (respectively) that is being linked to by FK.

``` ERROR[10-18|00:45:00.658] statediff.Service.WriteLoop: processing error block height=4,122,922 error="error building diff for updated accounts: error publishing storage node IPLD: pq: deadlock detected" worker=17 ``` ``` ERROR[10-13|23:07:05.327] statediff.Service.WriteLoop: processing error block height=3,307,833 error="error publishing log trie node IPLD: pq: deadlock detected" worker=6 ``` It only appears to be occurring for logTrie and storageTrie nodes, which is illuminating because these are the only nest tries. So it is likely do to an issue with the rctTrie or stateTrie leaf node (respectively) that is being linked to by FK.
arijitAD commented 2021-12-07 05:07:43 +00:00 (Migrated from github.com)

Deadlock can occur only when we are trying to modify the same data with two or more transactions.
We are multiple workers in geth which creates multiple transactions that can lead to a deadlock and in Postgres upon detecting a deadlock, once deadlock_timeout passes one of the transactions aborts itself and the other one commits itself.
This article explains the condition when deadlock can occur in detail: https://rcoh.svbtle.com/postgres-unique-constraints-can-cause-deadlock.
To resolve this issue, I think we should retry the transaction that got aborted or continue and later backfill the data.

Deadlock can occur only when we are trying to modify the same data with two or more transactions. We are multiple workers in `geth` which creates multiple transactions that can lead to a deadlock and in Postgres upon detecting a deadlock, once deadlock_timeout passes one of the transactions aborts itself and the other one commits itself. This article explains the condition when deadlock can occur in detail: https://rcoh.svbtle.com/postgres-unique-constraints-can-cause-deadlock. To resolve this issue, I think we should retry the transaction that got aborted or continue and later backfill the data.
Owner

Please retry the aborted transaction, thanks!

Please retry the aborted transaction, thanks!
Author
Member

Thanks! Any further insight into why it is occurring from a process perspective? I understand why deadlocks occur in Postgres, what's unclear is when/why we are modifying the same data simultaneously. If we are running multiple processes that process overlapping blocks it's clear why it occurs, but it's not apparent why we would see it within a single process with multiple goroutines. Those goroutines pull unique blocks off their shared work channel, except in the case of reorgs. But even with reorgs we would only process the same data* if the last common ancestor of the forks was replayed, but afaik it is not (I can't think of why it would be). Two reorgs switching away and then back to the same blocks could cause this even if LCA isn't replayed.

It's possible the changes made to the schema- using the new natural primary/foreign key scheme- in addition to having changed from doing DO UPDATE to DO NOTHING on most the tables on the postgres_refactor branch will have gotten rid of the underlying cause of these deadlocks.

*Same data including relational context, aka we have tons of the same trie nodes but they exist at different paths in the trie and/or exist at different block heights

Thanks! Any further insight into why it is occurring from a process perspective? I understand why deadlocks occur in Postgres, what's unclear is when/why we are modifying the same data simultaneously. If we are running multiple processes that process overlapping blocks it's clear why it occurs, but it's not apparent why we would see it within a single process with multiple goroutines. Those goroutines pull unique blocks off their shared work channel, except in the case of reorgs. But even with reorgs we would only process the same data* if the last common ancestor of the forks was replayed, but afaik it is not (I can't think of why it would be). Two reorgs switching away and then back to the same blocks could cause this even if LCA isn't replayed. It's possible the changes made to the schema- using the new natural primary/foreign key scheme- in addition to having changed from doing `DO UPDATE` to `DO NOTHING` on most the tables on the `postgres_refactor` branch will have gotten rid of the underlying cause of these deadlocks. *Same data including relational context, aka we have tons of the same trie nodes but they exist at different paths in the trie and/or exist at different block heights
arijitAD commented 2021-12-14 13:46:13 +00:00 (Migrated from github.com)

The deadlock is arising from the public.blocks table 1565911d66/statediff/indexer/shared/functions.go (L108)
Two different blocks can get the same storage node and log trie in diff. When we hash the same data it will be stored in the same key. Hence, the deadlock is arising.
If we retry this will get fixed. Not sure if we can avoid locking in this case since we are writing the same data.

The deadlock is arising from the `public.blocks` table https://github.com/vulcanize/go-ethereum/blob/1565911d66b1030e5924bed3c12a00804251167d/statediff/indexer/shared/functions.go#L108 Two different blocks can get the same storage node and log trie in diff. When we hash the same data it will be stored in the same key. Hence, the deadlock is arising. If we retry this will get fixed. Not sure if we can avoid locking in this case since we are writing the same data.
Author
Member

Thanks @arijitAD! That explains, all the duplicate data in public.blocks slipped my mind, and in that case our schema updates won't fix the issue.

Thanks @arijitAD! That explains, all the duplicate data in public.blocks slipped my mind, and in that case our schema updates won't fix the issue.
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/go-ethereum#136
No description provided.