2020-04-06 00:53:33 +00:00
|
|
|
use crate::BeaconSnapshot;
|
2022-01-14 07:20:55 +00:00
|
|
|
use itertools::process_results;
|
2020-04-06 00:53:33 +00:00
|
|
|
use std::cmp;
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
use std::sync::Arc;
|
2021-12-06 03:41:31 +00:00
|
|
|
use std::time::Duration;
|
2021-02-15 07:17:52 +00:00
|
|
|
use types::{
|
Separate execution payloads in the DB (#3157)
## Proposed Changes
Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.
:warning: **This is achieved in a backwards-incompatible way for networks that have already merged** :warning:. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.
The main changes are:
- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
- `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
- `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
- I've tested manually that it works on Kiln, using Geth and Nethermind.
- This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
- We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).
## Additional Info
- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-05-12 00:42:17 +00:00
|
|
|
beacon_state::CloneConfig, BeaconState, BlindedPayload, ChainSpec, Epoch, EthSpec, Hash256,
|
|
|
|
SignedBeaconBlock, Slot,
|
2021-02-15 07:17:52 +00:00
|
|
|
};
|
2020-04-06 00:53:33 +00:00
|
|
|
|
|
|
|
/// The default size of the cache.
|
|
|
|
pub const DEFAULT_SNAPSHOT_CACHE_SIZE: usize = 4;
|
|
|
|
|
2021-12-06 03:41:31 +00:00
|
|
|
/// The minimum block delay to clone the state in the cache instead of removing it.
|
|
|
|
/// This helps keep block processing fast during re-orgs from late blocks.
|
2022-12-13 09:57:26 +00:00
|
|
|
fn minimum_block_delay_for_clone(seconds_per_slot: u64) -> Duration {
|
|
|
|
// If the block arrived at the attestation deadline or later, it might get re-orged.
|
|
|
|
Duration::from_secs(seconds_per_slot) / 3
|
|
|
|
}
|
2021-12-06 03:41:31 +00:00
|
|
|
|
2021-02-15 07:17:52 +00:00
|
|
|
/// This snapshot is to be used for verifying a child of `self.beacon_block`.
|
Handle early blocks (#2155)
## Issue Addressed
NA
## Problem this PR addresses
There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:
1. Gossip block 0xabc arrives ~200ms early
- It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
- However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
- The peer that sent the attestation is down-voted.
- Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
- When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).
## Potential solutions
I can think of three solutions to this:
1. Wait for attestation-queuing (#635) to arrive and solve this.
- Easy
- Not immediate fix.
- Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
- Easy
- ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
- More difficult
- Feels like the best solution, I will try to implement this.
**This PR takes approach (3).**
## Changes included
- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
- To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
- The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
- To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
2021-02-24 03:08:52 +00:00
|
|
|
#[derive(Debug)]
|
2021-02-15 07:17:52 +00:00
|
|
|
pub struct PreProcessingSnapshot<T: EthSpec> {
|
|
|
|
/// This state is equivalent to the `self.beacon_block.state_root()` state that has been
|
|
|
|
/// advanced forward one slot using `per_slot_processing`. This state is "primed and ready" for
|
|
|
|
/// the application of another block.
|
|
|
|
pub pre_state: BeaconState<T>,
|
2021-03-17 05:09:57 +00:00
|
|
|
/// This value is only set to `Some` if the `pre_state` was *not* advanced forward.
|
|
|
|
pub beacon_state_root: Option<Hash256>,
|
Separate execution payloads in the DB (#3157)
## Proposed Changes
Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.
:warning: **This is achieved in a backwards-incompatible way for networks that have already merged** :warning:. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.
The main changes are:
- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
- `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
- `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
- I've tested manually that it works on Kiln, using Geth and Nethermind.
- This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
- We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).
## Additional Info
- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-05-12 00:42:17 +00:00
|
|
|
pub beacon_block: SignedBeaconBlock<T, BlindedPayload<T>>,
|
2021-02-15 07:17:52 +00:00
|
|
|
pub beacon_block_root: Hash256,
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: EthSpec> From<BeaconSnapshot<T>> for PreProcessingSnapshot<T> {
|
|
|
|
fn from(snapshot: BeaconSnapshot<T>) -> Self {
|
2021-03-17 05:09:57 +00:00
|
|
|
let beacon_state_root = Some(snapshot.beacon_state_root());
|
2021-02-15 07:17:52 +00:00
|
|
|
Self {
|
|
|
|
pre_state: snapshot.beacon_state,
|
2021-03-17 05:09:57 +00:00
|
|
|
beacon_state_root,
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: snapshot.beacon_block.clone_as_blinded(),
|
2021-02-15 07:17:52 +00:00
|
|
|
beacon_block_root: snapshot.beacon_block_root,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: EthSpec> CacheItem<T> {
|
|
|
|
pub fn new_without_pre_state(snapshot: BeaconSnapshot<T>) -> Self {
|
|
|
|
Self {
|
|
|
|
beacon_block: snapshot.beacon_block,
|
|
|
|
beacon_block_root: snapshot.beacon_block_root,
|
|
|
|
beacon_state: snapshot.beacon_state,
|
|
|
|
pre_state: None,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fn clone_to_snapshot_with(&self, clone_config: CloneConfig) -> BeaconSnapshot<T> {
|
|
|
|
BeaconSnapshot {
|
|
|
|
beacon_state: self.beacon_state.clone_with(clone_config),
|
|
|
|
beacon_block: self.beacon_block.clone(),
|
|
|
|
beacon_block_root: self.beacon_block_root,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn into_pre_state(self) -> PreProcessingSnapshot<T> {
|
2021-03-17 05:09:57 +00:00
|
|
|
// Do not include the beacon state root if the state has been advanced.
|
|
|
|
let beacon_state_root =
|
|
|
|
Some(self.beacon_block.state_root()).filter(|_| self.pre_state.is_none());
|
|
|
|
|
2021-02-15 07:17:52 +00:00
|
|
|
PreProcessingSnapshot {
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: self.beacon_block.clone_as_blinded(),
|
2021-02-15 07:17:52 +00:00
|
|
|
beacon_block_root: self.beacon_block_root,
|
|
|
|
pre_state: self.pre_state.unwrap_or(self.beacon_state),
|
2021-03-17 05:09:57 +00:00
|
|
|
beacon_state_root,
|
2021-02-15 07:17:52 +00:00
|
|
|
}
|
|
|
|
}
|
2021-12-06 03:41:31 +00:00
|
|
|
|
|
|
|
pub fn clone_as_pre_state(&self) -> PreProcessingSnapshot<T> {
|
|
|
|
// Do not include the beacon state root if the state has been advanced.
|
|
|
|
let beacon_state_root =
|
|
|
|
Some(self.beacon_block.state_root()).filter(|_| self.pre_state.is_none());
|
|
|
|
|
|
|
|
PreProcessingSnapshot {
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: self.beacon_block.clone_as_blinded(),
|
2021-12-06 03:41:31 +00:00
|
|
|
beacon_block_root: self.beacon_block_root,
|
|
|
|
pre_state: self
|
|
|
|
.pre_state
|
|
|
|
.as_ref()
|
|
|
|
.map_or_else(|| self.beacon_state.clone(), |pre_state| pre_state.clone()),
|
|
|
|
beacon_state_root,
|
|
|
|
}
|
|
|
|
}
|
2021-02-15 07:17:52 +00:00
|
|
|
}
|
|
|
|
|
2021-03-04 04:43:31 +00:00
|
|
|
/// The information required for block production.
|
|
|
|
pub struct BlockProductionPreState<T: EthSpec> {
|
|
|
|
/// This state may or may not have been advanced forward a single slot.
|
|
|
|
///
|
|
|
|
/// See the documentation in the `crate::state_advance_timer` module for more information.
|
|
|
|
pub pre_state: BeaconState<T>,
|
|
|
|
/// This value will only be `Some` if `self.pre_state` was **not** advanced forward a single
|
|
|
|
/// slot.
|
|
|
|
///
|
|
|
|
/// This value can be used to avoid tree-hashing the state during the first call to
|
|
|
|
/// `per_slot_processing`.
|
|
|
|
pub state_root: Option<Hash256>,
|
2021-02-15 07:17:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
pub enum StateAdvance<T: EthSpec> {
|
|
|
|
/// The cache does not contain the supplied block root.
|
|
|
|
BlockNotFound,
|
|
|
|
/// The cache contains the supplied block root but the state has already been advanced.
|
|
|
|
AlreadyAdvanced,
|
|
|
|
/// The cache contains the supplied block root and the state has not yet been advanced.
|
|
|
|
State {
|
|
|
|
state: Box<BeaconState<T>>,
|
|
|
|
state_root: Hash256,
|
|
|
|
block_slot: Slot,
|
|
|
|
},
|
|
|
|
}
|
|
|
|
|
|
|
|
/// The item stored in the `SnapshotCache`.
|
|
|
|
pub struct CacheItem<T: EthSpec> {
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: Arc<SignedBeaconBlock<T>>,
|
2021-02-15 07:17:52 +00:00
|
|
|
beacon_block_root: Hash256,
|
|
|
|
/// This state is equivalent to `self.beacon_block.state_root()`.
|
|
|
|
beacon_state: BeaconState<T>,
|
|
|
|
/// This state is equivalent to `self.beacon_state` that has had `per_slot_processing` applied
|
|
|
|
/// to it. This state assists in optimizing block processing.
|
|
|
|
pre_state: Option<BeaconState<T>>,
|
|
|
|
}
|
|
|
|
|
2021-03-04 04:43:31 +00:00
|
|
|
impl<T: EthSpec> Into<BeaconSnapshot<T>> for CacheItem<T> {
|
|
|
|
fn into(self) -> BeaconSnapshot<T> {
|
|
|
|
BeaconSnapshot {
|
|
|
|
beacon_state: self.beacon_state,
|
|
|
|
beacon_block: self.beacon_block,
|
|
|
|
beacon_block_root: self.beacon_block_root,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-06 00:53:33 +00:00
|
|
|
/// Provides a cache of `BeaconSnapshot` that is intended primarily for block processing.
|
|
|
|
///
|
|
|
|
/// ## Cache Queuing
|
|
|
|
///
|
|
|
|
/// The cache has a non-standard queue mechanism (specifically, it is not LRU).
|
|
|
|
///
|
|
|
|
/// The cache has a max number of elements (`max_len`). Until `max_len` is achieved, all snapshots
|
|
|
|
/// are simply added to the queue. Once `max_len` is achieved, adding a new snapshot will cause an
|
|
|
|
/// existing snapshot to be ejected. The ejected snapshot will:
|
|
|
|
///
|
|
|
|
/// - Never be the `head_block_root`.
|
|
|
|
/// - Be the snapshot with the lowest `state.slot` (ties broken arbitrarily).
|
|
|
|
pub struct SnapshotCache<T: EthSpec> {
|
|
|
|
max_len: usize,
|
|
|
|
head_block_root: Hash256,
|
2021-02-15 07:17:52 +00:00
|
|
|
snapshots: Vec<CacheItem<T>>,
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
impl<T: EthSpec> SnapshotCache<T> {
|
|
|
|
/// Instantiate a new cache which contains the `head` snapshot.
|
|
|
|
///
|
|
|
|
/// Setting `max_len = 0` is equivalent to setting `max_len = 1`.
|
|
|
|
pub fn new(max_len: usize, head: BeaconSnapshot<T>) -> Self {
|
|
|
|
Self {
|
|
|
|
max_len: cmp::max(max_len, 1),
|
|
|
|
head_block_root: head.beacon_block_root,
|
2021-02-15 07:17:52 +00:00
|
|
|
snapshots: vec![CacheItem::new_without_pre_state(head)],
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-01-14 07:20:55 +00:00
|
|
|
/// The block roots of all snapshots contained in `self`.
|
|
|
|
pub fn beacon_block_roots(&self) -> Vec<Hash256> {
|
|
|
|
self.snapshots.iter().map(|s| s.beacon_block_root).collect()
|
|
|
|
}
|
|
|
|
|
|
|
|
/// The number of snapshots contained in `self`.
|
|
|
|
pub fn len(&self) -> usize {
|
|
|
|
self.snapshots.len()
|
|
|
|
}
|
|
|
|
|
2020-04-06 00:53:33 +00:00
|
|
|
/// Insert a snapshot, potentially removing an existing snapshot if `self` is at capacity (see
|
|
|
|
/// struct-level documentation for more info).
|
2022-01-14 07:20:55 +00:00
|
|
|
pub fn insert(
|
|
|
|
&mut self,
|
|
|
|
snapshot: BeaconSnapshot<T>,
|
|
|
|
pre_state: Option<BeaconState<T>>,
|
|
|
|
spec: &ChainSpec,
|
|
|
|
) {
|
|
|
|
let parent_root = snapshot.beacon_block.message().parent_root();
|
2021-02-15 07:17:52 +00:00
|
|
|
let item = CacheItem {
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: snapshot.beacon_block.clone(),
|
2021-02-15 07:17:52 +00:00
|
|
|
beacon_block_root: snapshot.beacon_block_root,
|
|
|
|
beacon_state: snapshot.beacon_state,
|
|
|
|
pre_state,
|
|
|
|
};
|
|
|
|
|
2022-01-14 07:20:55 +00:00
|
|
|
// Remove the grandparent of the block that was just inserted.
|
|
|
|
//
|
|
|
|
// Assuming it's unlikely to see re-orgs deeper than one block, this method helps keep the
|
|
|
|
// cache small by removing any states that already have more than one descendant.
|
|
|
|
//
|
|
|
|
// Remove the grandparent first to free up room in the cache.
|
|
|
|
let grandparent_result =
|
|
|
|
process_results(item.beacon_state.rev_iter_block_roots(spec), |iter| {
|
|
|
|
iter.map(|(_slot, root)| root)
|
|
|
|
.find(|root| *root != item.beacon_block_root && *root != parent_root)
|
|
|
|
});
|
|
|
|
if let Ok(Some(grandparent_root)) = grandparent_result {
|
|
|
|
let head_block_root = self.head_block_root;
|
|
|
|
self.snapshots.retain(|snapshot| {
|
|
|
|
let root = snapshot.beacon_block_root;
|
|
|
|
root == head_block_root || root != grandparent_root
|
|
|
|
});
|
|
|
|
}
|
|
|
|
|
2020-04-06 00:53:33 +00:00
|
|
|
if self.snapshots.len() < self.max_len {
|
2021-02-15 07:17:52 +00:00
|
|
|
self.snapshots.push(item);
|
2020-04-06 00:53:33 +00:00
|
|
|
} else {
|
|
|
|
let insert_at = self
|
|
|
|
.snapshots
|
|
|
|
.iter()
|
|
|
|
.enumerate()
|
|
|
|
.filter_map(|(i, snapshot)| {
|
|
|
|
if snapshot.beacon_block_root != self.head_block_root {
|
2021-07-09 06:15:32 +00:00
|
|
|
Some((i, snapshot.beacon_state.slot()))
|
2020-04-06 00:53:33 +00:00
|
|
|
} else {
|
|
|
|
None
|
|
|
|
}
|
|
|
|
})
|
|
|
|
.min_by_key(|(_i, slot)| *slot)
|
|
|
|
.map(|(i, _slot)| i);
|
|
|
|
|
|
|
|
if let Some(i) = insert_at {
|
2021-02-15 07:17:52 +00:00
|
|
|
self.snapshots[i] = item;
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-03-04 04:43:31 +00:00
|
|
|
/// If available, returns a `CacheItem` that should be used for importing/processing a block.
|
|
|
|
/// The method will remove the block from `self`, carrying across any caches that may or may not
|
|
|
|
/// be built.
|
2021-12-06 03:41:31 +00:00
|
|
|
///
|
|
|
|
/// In the event the block being processed was observed late, clone the cache instead of
|
|
|
|
/// moving it. This allows us to process the next block quickly in the case of a re-org.
|
|
|
|
/// Additionally, if the slot was skipped, clone the cache. This ensures blocks that are
|
|
|
|
/// later than 1 slot still have access to the cache and can be processed quickly.
|
|
|
|
pub fn get_state_for_block_processing(
|
|
|
|
&mut self,
|
|
|
|
block_root: Hash256,
|
|
|
|
block_slot: Slot,
|
|
|
|
block_delay: Option<Duration>,
|
|
|
|
spec: &ChainSpec,
|
|
|
|
) -> Option<(PreProcessingSnapshot<T>, bool)> {
|
2020-04-06 00:53:33 +00:00
|
|
|
self.snapshots
|
|
|
|
.iter()
|
|
|
|
.position(|snapshot| snapshot.beacon_block_root == block_root)
|
2021-12-06 03:41:31 +00:00
|
|
|
.map(|i| {
|
|
|
|
if let Some(cache) = self.snapshots.get(i) {
|
2022-06-20 23:20:29 +00:00
|
|
|
// Avoid cloning the block during sync (when the `block_delay` is `None`).
|
2021-12-06 03:41:31 +00:00
|
|
|
if let Some(delay) = block_delay {
|
2022-12-13 09:57:26 +00:00
|
|
|
if delay >= minimum_block_delay_for_clone(spec.seconds_per_slot)
|
2021-12-06 03:41:31 +00:00
|
|
|
&& delay <= Duration::from_secs(spec.seconds_per_slot) * 4
|
2022-06-20 23:20:29 +00:00
|
|
|
|| block_slot > cache.beacon_block.slot() + 1
|
2021-12-06 03:41:31 +00:00
|
|
|
{
|
|
|
|
return (cache.clone_as_pre_state(), true);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
(self.snapshots.remove(i).into_pre_state(), false)
|
|
|
|
})
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
|
2021-03-04 04:43:31 +00:00
|
|
|
/// If available, obtains a clone of a `BeaconState` that should be used for block production.
|
|
|
|
/// The clone will use `CloneConfig:all()`, ensuring any tree-hash cache is cloned too.
|
|
|
|
///
|
|
|
|
/// ## Note
|
|
|
|
///
|
|
|
|
/// This method clones the `BeaconState` (instead of removing it) since we assume that any block
|
|
|
|
/// we produce will soon be pushed to the `BeaconChain` for importing/processing. Keeping a copy
|
|
|
|
/// of that `BeaconState` in `self` will greatly help with import times.
|
|
|
|
pub fn get_state_for_block_production(
|
|
|
|
&self,
|
|
|
|
block_root: Hash256,
|
|
|
|
) -> Option<BlockProductionPreState<T>> {
|
|
|
|
self.snapshots
|
|
|
|
.iter()
|
|
|
|
.find(|snapshot| snapshot.beacon_block_root == block_root)
|
|
|
|
.map(|snapshot| {
|
|
|
|
if let Some(pre_state) = &snapshot.pre_state {
|
|
|
|
BlockProductionPreState {
|
|
|
|
pre_state: pre_state.clone_with(CloneConfig::all()),
|
|
|
|
state_root: None,
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
BlockProductionPreState {
|
|
|
|
pre_state: snapshot.beacon_state.clone_with(CloneConfig::all()),
|
|
|
|
state_root: Some(snapshot.beacon_block.state_root()),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
Optimise tree hash caching for block production (#2106)
## Proposed Changes
`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.
I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.
## Alternatives
I experimented with 2 alternatives to this approach, before deciding on it:
* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.
In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements :tada:
## Additional Info
This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.
In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
2020-12-21 06:29:39 +00:00
|
|
|
/// If there is a snapshot with `block_root`, clone it and return the clone.
|
|
|
|
pub fn get_cloned(
|
|
|
|
&self,
|
|
|
|
block_root: Hash256,
|
|
|
|
clone_config: CloneConfig,
|
|
|
|
) -> Option<BeaconSnapshot<T>> {
|
2020-04-06 00:53:33 +00:00
|
|
|
self.snapshots
|
|
|
|
.iter()
|
|
|
|
.find(|snapshot| snapshot.beacon_block_root == block_root)
|
2021-02-15 07:17:52 +00:00
|
|
|
.map(|snapshot| snapshot.clone_to_snapshot_with(clone_config))
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn get_for_state_advance(&mut self, block_root: Hash256) -> StateAdvance<T> {
|
|
|
|
if let Some(snapshot) = self
|
|
|
|
.snapshots
|
|
|
|
.iter_mut()
|
|
|
|
.find(|snapshot| snapshot.beacon_block_root == block_root)
|
|
|
|
{
|
|
|
|
if snapshot.pre_state.is_some() {
|
|
|
|
StateAdvance::AlreadyAdvanced
|
|
|
|
} else {
|
|
|
|
let cloned = snapshot
|
|
|
|
.beacon_state
|
|
|
|
.clone_with(CloneConfig::committee_caches_only());
|
|
|
|
|
|
|
|
StateAdvance::State {
|
|
|
|
state: Box::new(std::mem::replace(&mut snapshot.beacon_state, cloned)),
|
|
|
|
state_root: snapshot.beacon_block.state_root(),
|
|
|
|
block_slot: snapshot.beacon_block.slot(),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
StateAdvance::BlockNotFound
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pub fn update_pre_state(&mut self, block_root: Hash256, state: BeaconState<T>) -> Option<()> {
|
|
|
|
self.snapshots
|
|
|
|
.iter_mut()
|
|
|
|
.find(|snapshot| snapshot.beacon_block_root == block_root)
|
|
|
|
.map(|snapshot| {
|
|
|
|
snapshot.pre_state = Some(state);
|
|
|
|
})
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/// Removes all snapshots from the queue that are less than or equal to the finalized epoch.
|
|
|
|
pub fn prune(&mut self, finalized_epoch: Epoch) {
|
|
|
|
self.snapshots.retain(|snapshot| {
|
2021-07-09 06:15:32 +00:00
|
|
|
snapshot.beacon_state.slot() > finalized_epoch.start_slot(T::slots_per_epoch())
|
2020-04-06 00:53:33 +00:00
|
|
|
})
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Inform the cache that the head of the beacon chain has changed.
|
|
|
|
///
|
|
|
|
/// The snapshot that matches this `head_block_root` will never be ejected from the cache
|
|
|
|
/// during `Self::insert`.
|
|
|
|
pub fn update_head(&mut self, head_block_root: Hash256) {
|
|
|
|
self.head_block_root = head_block_root
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[cfg(test)]
|
|
|
|
mod test {
|
|
|
|
use super::*;
|
2021-07-09 06:15:32 +00:00
|
|
|
use crate::test_utils::{BeaconChainHarness, EphemeralHarnessType};
|
2020-04-06 00:53:33 +00:00
|
|
|
use types::{
|
2021-07-09 06:15:32 +00:00
|
|
|
test_utils::generate_deterministic_keypair, BeaconBlock, Epoch, MainnetEthSpec,
|
|
|
|
SignedBeaconBlock, Slot,
|
2020-04-06 00:53:33 +00:00
|
|
|
};
|
|
|
|
|
2021-07-09 06:15:32 +00:00
|
|
|
fn get_harness() -> BeaconChainHarness<EphemeralHarnessType<MainnetEthSpec>> {
|
2021-10-14 02:58:10 +00:00
|
|
|
let harness = BeaconChainHarness::builder(MainnetEthSpec)
|
|
|
|
.default_spec()
|
|
|
|
.deterministic_keypairs(1)
|
|
|
|
.fresh_ephemeral_store()
|
|
|
|
.build();
|
2021-07-09 06:15:32 +00:00
|
|
|
|
|
|
|
harness.advance_slot();
|
|
|
|
|
|
|
|
harness
|
|
|
|
}
|
|
|
|
|
2020-04-06 00:53:33 +00:00
|
|
|
const CACHE_SIZE: usize = 4;
|
|
|
|
|
|
|
|
fn get_snapshot(i: u64) -> BeaconSnapshot<MainnetEthSpec> {
|
|
|
|
let spec = MainnetEthSpec::default_spec();
|
|
|
|
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
let beacon_state = get_harness().chain.head_beacon_state_cloned();
|
2021-07-09 06:15:32 +00:00
|
|
|
|
|
|
|
let signed_beacon_block = SignedBeaconBlock::from_block(
|
|
|
|
BeaconBlock::empty(&spec),
|
|
|
|
generate_deterministic_keypair(0)
|
|
|
|
.sk
|
|
|
|
.sign(Hash256::from_low_u64_be(42)),
|
|
|
|
);
|
2020-04-06 00:53:33 +00:00
|
|
|
|
|
|
|
BeaconSnapshot {
|
|
|
|
beacon_state,
|
Use async code when interacting with EL (#3244)
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough :sweat_smile:)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
2022-07-03 05:36:50 +00:00
|
|
|
beacon_block: Arc::new(signed_beacon_block),
|
2020-04-06 00:53:33 +00:00
|
|
|
beacon_block_root: Hash256::from_low_u64_be(i),
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#[test]
|
|
|
|
fn insert_get_prune_update() {
|
2021-12-06 03:41:31 +00:00
|
|
|
let spec = MainnetEthSpec::default_spec();
|
2020-04-06 00:53:33 +00:00
|
|
|
let mut cache = SnapshotCache::new(CACHE_SIZE, get_snapshot(0));
|
|
|
|
|
|
|
|
// Insert a bunch of entries in the cache. It should look like this:
|
|
|
|
//
|
|
|
|
// Index Root
|
|
|
|
// 0 0 <--head
|
|
|
|
// 1 1
|
|
|
|
// 2 2
|
|
|
|
// 3 3
|
|
|
|
for i in 1..CACHE_SIZE as u64 {
|
|
|
|
let mut snapshot = get_snapshot(i);
|
|
|
|
|
|
|
|
// Each snapshot should be one slot into an epoch, with each snapshot one epoch apart.
|
2021-07-09 06:15:32 +00:00
|
|
|
*snapshot.beacon_state.slot_mut() =
|
|
|
|
Slot::from(i * MainnetEthSpec::slots_per_epoch() + 1);
|
2020-04-06 00:53:33 +00:00
|
|
|
|
2022-01-14 07:20:55 +00:00
|
|
|
cache.insert(snapshot, None, &spec);
|
2020-04-06 00:53:33 +00:00
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
cache.snapshots.len(),
|
|
|
|
i as usize + 1,
|
|
|
|
"cache length should be as expected"
|
|
|
|
);
|
|
|
|
assert_eq!(cache.head_block_root, Hash256::from_low_u64_be(0));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Insert a new value in the cache. Afterwards it should look like:
|
|
|
|
//
|
|
|
|
// Index Root
|
|
|
|
// 0 0 <--head
|
|
|
|
// 1 42
|
|
|
|
// 2 2
|
|
|
|
// 3 3
|
|
|
|
assert_eq!(cache.snapshots.len(), CACHE_SIZE);
|
2022-01-14 07:20:55 +00:00
|
|
|
cache.insert(get_snapshot(42), None, &spec);
|
2020-04-06 00:53:33 +00:00
|
|
|
assert_eq!(cache.snapshots.len(), CACHE_SIZE);
|
|
|
|
|
|
|
|
assert!(
|
2021-03-04 04:43:31 +00:00
|
|
|
cache
|
2021-12-06 03:41:31 +00:00
|
|
|
.get_state_for_block_processing(
|
|
|
|
Hash256::from_low_u64_be(1),
|
|
|
|
Slot::new(0),
|
|
|
|
None,
|
|
|
|
&spec
|
|
|
|
)
|
2021-03-04 04:43:31 +00:00
|
|
|
.is_none(),
|
2020-04-06 00:53:33 +00:00
|
|
|
"the snapshot with the lowest slot should have been removed during the insert function"
|
|
|
|
);
|
Optimise tree hash caching for block production (#2106)
## Proposed Changes
`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.
I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.
## Alternatives
I experimented with 2 alternatives to this approach, before deciding on it:
* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.
In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements :tada:
## Additional Info
This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.
In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
2020-12-21 06:29:39 +00:00
|
|
|
assert!(cache
|
|
|
|
.get_cloned(Hash256::from_low_u64_be(1), CloneConfig::none())
|
|
|
|
.is_none());
|
2020-04-06 00:53:33 +00:00
|
|
|
|
2021-07-09 06:15:32 +00:00
|
|
|
assert_eq!(
|
2020-04-06 00:53:33 +00:00
|
|
|
cache
|
Optimise tree hash caching for block production (#2106)
## Proposed Changes
`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.
I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.
## Alternatives
I experimented with 2 alternatives to this approach, before deciding on it:
* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.
In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements :tada:
## Additional Info
This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.
In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
2020-12-21 06:29:39 +00:00
|
|
|
.get_cloned(Hash256::from_low_u64_be(0), CloneConfig::none())
|
2020-04-06 00:53:33 +00:00
|
|
|
.expect("the head should still be in the cache")
|
2021-07-09 06:15:32 +00:00
|
|
|
.beacon_block_root,
|
|
|
|
Hash256::from_low_u64_be(0),
|
2020-04-06 00:53:33 +00:00
|
|
|
"get_cloned should get the correct snapshot"
|
|
|
|
);
|
2021-07-09 06:15:32 +00:00
|
|
|
assert_eq!(
|
2020-04-06 00:53:33 +00:00
|
|
|
cache
|
2021-12-06 03:41:31 +00:00
|
|
|
.get_state_for_block_processing(
|
|
|
|
Hash256::from_low_u64_be(0),
|
|
|
|
Slot::new(0),
|
|
|
|
None,
|
|
|
|
&spec
|
|
|
|
)
|
2020-04-06 00:53:33 +00:00
|
|
|
.expect("the head should still be in the cache")
|
2021-12-06 03:41:31 +00:00
|
|
|
.0
|
2021-07-09 06:15:32 +00:00
|
|
|
.beacon_block_root,
|
|
|
|
Hash256::from_low_u64_be(0),
|
2021-03-04 04:43:31 +00:00
|
|
|
"get_state_for_block_processing should get the correct snapshot"
|
2020-04-06 00:53:33 +00:00
|
|
|
);
|
|
|
|
|
|
|
|
assert_eq!(
|
|
|
|
cache.snapshots.len(),
|
|
|
|
CACHE_SIZE - 1,
|
2021-03-04 04:43:31 +00:00
|
|
|
"get_state_for_block_processing should shorten the cache"
|
2020-04-06 00:53:33 +00:00
|
|
|
);
|
|
|
|
|
|
|
|
// Prune the cache. Afterwards it should look like:
|
|
|
|
//
|
|
|
|
// Index Root
|
|
|
|
// 0 2
|
|
|
|
// 1 3
|
|
|
|
cache.prune(Epoch::new(2));
|
|
|
|
|
|
|
|
assert_eq!(cache.snapshots.len(), 2);
|
|
|
|
|
|
|
|
cache.update_head(Hash256::from_low_u64_be(2));
|
|
|
|
|
|
|
|
// Over-fill the cache so it needs to eject some old values on insert.
|
|
|
|
for i in 0..CACHE_SIZE as u64 {
|
2022-01-14 07:20:55 +00:00
|
|
|
cache.insert(get_snapshot(u64::max_value() - i), None, &spec);
|
2020-04-06 00:53:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Ensure that the new head value was not removed from the cache.
|
2021-07-09 06:15:32 +00:00
|
|
|
assert_eq!(
|
2020-04-06 00:53:33 +00:00
|
|
|
cache
|
2021-12-06 03:41:31 +00:00
|
|
|
.get_state_for_block_processing(
|
|
|
|
Hash256::from_low_u64_be(2),
|
|
|
|
Slot::new(0),
|
|
|
|
None,
|
|
|
|
&spec
|
|
|
|
)
|
2020-04-06 00:53:33 +00:00
|
|
|
.expect("the new head should still be in the cache")
|
2021-12-06 03:41:31 +00:00
|
|
|
.0
|
2021-07-09 06:15:32 +00:00
|
|
|
.beacon_block_root,
|
|
|
|
Hash256::from_low_u64_be(2),
|
2021-03-04 04:43:31 +00:00
|
|
|
"get_state_for_block_processing should get the correct snapshot"
|
2020-04-06 00:53:33 +00:00
|
|
|
);
|
|
|
|
}
|
|
|
|
}
|