From 371c216ac366817e21202019ad042e8744cb331c Mon Sep 17 00:00:00 2001 From: Michael Sproul Date: Mon, 12 Jul 2021 07:31:26 +0000 Subject: [PATCH] Use read_recursive locks in database (#2417) ## Issue Addressed Closes #2245 ## Proposed Changes Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`. ## Additional Info * Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector. * The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us. * I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp * [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation. --- beacon_node/store/src/hot_cold_store.rs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/beacon_node/store/src/hot_cold_store.rs b/beacon_node/store/src/hot_cold_store.rs index 3840ade4f..382342737 100644 --- a/beacon_node/store/src/hot_cold_store.rs +++ b/beacon_node/store/src/hot_cold_store.rs @@ -721,7 +721,7 @@ impl, Cold: ItemStore> HotColdDB let high_restore_point_idx = low_restore_point_idx + 1; // Acquire the read lock, so that the split can't change while this is happening. - let split = self.split.read(); + let split = self.split.read_recursive(); let low_restore_point = self.load_restore_point_by_index(low_restore_point_idx)?; // If the slot of the high point lies outside the freezer, use the split state @@ -883,7 +883,7 @@ impl, Cold: ItemStore> HotColdDB /// Fetch a copy of the current split slot from memory. pub fn get_split_slot(&self) -> Slot { - self.split.read().slot + self.split.read_recursive().slot } /// Fetch the slot of the most recently stored restore point. @@ -1056,7 +1056,7 @@ pub fn migrate_database, Cold: ItemStore>( // 0. Check that the migration is sensible. // The new frozen head must increase the current split slot, and lie on an epoch // boundary (in order for the hot state summary scheme to work). - let current_split_slot = store.split.read().slot; + let current_split_slot = store.split.read_recursive().slot; if frozen_head.slot() < current_split_slot { return Err(HotColdDBError::FreezeSlotError {