25f0e261cb
14 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Paul Hauner
|
25f0e261cb |
Don't return errors when fork choice fails (#3370)
## Issue Addressed NA ## Proposed Changes There are scenarios where the only viable head will have an invalid execution payload, in this scenario the `get_head` function on `proto_array` will return an error. We must recover from this scenario by importing blocks from the network. This PR stops `BeaconChain::recompute_head` from returning an error so that we can't accidentally start down-scoring peers or aborting block import just because the current head has an invalid payload. ## Reviewer Notes The following changes are included: 1. Allow `fork_choice.get_head` to fail gracefully in `BeaconChain::process_block` when trying to update the `early_attester_cache`; simply don't add the block to the cache rather than aborting the entire process. 1. Don't return an error from `BeaconChain::recompute_head_at_current_slot` and `BeaconChain::recompute_head` to defensively prevent calling functions from aborting any process just because the fork choice function failed to run. - This should have practically no effect, since most callers were still continuing if recomputing the head failed. - The outlier is that the API will return 200 rather than a 500 when fork choice fails. 1. Add the `ProtoArrayForkChoice::set_all_blocks_to_optimistic` function to recover from the scenario where we've rebooted and the persisted fork choice has an invalid head. |
||
Paul Hauner
|
be4e261e74 |
Use async code when interacting with EL (#3244)
## Overview This rather extensive PR achieves two primary goals: 1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state. 2. Refactors fork choice, block production and block processing to `async` functions. Additionally, it achieves: - Concurrent forkchoice updates to the EL and cache pruning after a new head is selected. - Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production. - Concurrent per-block-processing and execution payload verification during block processing. - The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?): - I had to do this to deal with sending blocks into spawned tasks. - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones. - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap. - Avoids cloning *all the blocks* in *every chain segment* during sync. - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅) - The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs. For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273 ## Changes to `canonical_head` and `fork_choice` Previously, the `BeaconChain` had two separate fields: ``` canonical_head: RwLock<Snapshot>, fork_choice: RwLock<BeaconForkChoice> ``` Now, we have grouped these values under a single struct: ``` canonical_head: CanonicalHead { cached_head: RwLock<Arc<Snapshot>>, fork_choice: RwLock<BeaconForkChoice> } ``` Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously. ## Breaking Changes ### The `state` (root) field in the `finalized_checkpoint` SSE event Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event: 1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`. 4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots. Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event]( |
||
Michael Sproul
|
229f883968 |
Avoid parallel fork choice runs during sync (#3217)
## Issue Addressed Fixes an issue that @paulhauner found with the v2.3.0 release candidate whereby the fork choice runs introduced by #3168 tripped over each other during sync: ``` May 24 23:06:40.542 WARN Error signalling fork choice waiter slot: 3884129, error: ForkChoiceSignalOutOfOrder { current: Slot(3884131), latest: Slot(3884129) }, service: beacon ``` This can occur because fork choice is called from the state advance _and_ the per-slot task. When one of these runs takes a long time it can end up finishing after a run from a later slot, tripping the error above. The problem is resolved by not running either of these fork choice calls during sync. Additionally, these parallel fork choice runs were causing issues in the database: ``` May 24 07:49:05.098 WARN Found a chain that should already have been pruned, head_slot: 92925, head_block_root: 0xa76c7bf1b98e54ed4b0d8686efcfdf853484e6c2a4c67e91cbf19e5ad1f96b17, service: beacon May 24 07:49:05.101 WARN Database migration failed error: HotColdDBError(FreezeSlotError { current_split_slot: Slot(92608), proposed_split_slot: Slot(92576) }), service: beacon ``` In this case, two fork choice calls triggering the finalization processing were being processed out of order due to differences in their processing time, causing the background migrator to try to advance finalization _backwards_ 😳. Removing the parallel fork choice runs from sync effectively addresses the issue, because these runs are most likely to have different finalized checkpoints (because of the speed at which fork choice advances during sync). In theory it's still possible to process updates out of order if any other fork choice runs end up completing out of order, but this should be much less common. Fixing out of order fork choice runs in general is difficult as it requires architectural changes like serialising fork choice updates through a single thread, or locking fork choice along with the head when it is mutated (https://github.com/sigp/lighthouse/pull/3175). ## Proposed Changes * Don't run per-slot fork choice during sync (if head is older than 4 slots) * Don't run state-advance fork choice during sync (if head is older than 4 slots) * Check for monotonic finalization updates in the background migrator. This is a good defensive check to have, and I'm not sure why we didn't have it before (we may have had it and wrongly removed it). |
||
Michael Sproul
|
8fa032c8ae |
Run fork choice before block proposal (#3168)
## Issue Addressed Upcoming spec change https://github.com/ethereum/consensus-specs/pull/2878 ## Proposed Changes 1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block. 2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_. 3. Remove the fork choice run from the state advance timer that occurred before advancing the state. ## Additional Info ### Block Proposal Accuracy This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue. ### Attestation Accuracy This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B. ### Why remove the call before the state advance? If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea). ### Performance Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes. ### Implementation Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually. |
||
Michael Sproul
|
ae47a93c42 |
Don't panic in forkchoiceUpdated handler (#3165)
## Issue Addressed Fix a panic due to misuse of the Tokio executor when processing a forkchoiceUpdated response. We were previously calling `process_invalid_execution_payload` from the async function `update_execution_engine_forkchoice_async`, which resulted in a panic because `process_invalid_execution_payload` contains a call to fork choice, which ultimately calls `block_on`. An example backtrace can be found here: https://gist.github.com/michaelsproul/ac5da03e203d6ffac672423eaf52fb20 ## Proposed Changes Wrap the call to `process_invalid_execution_payload` in a `spawn_blocking` so that `block_on` is no longer called from an async context. ## Additional Info - I've been thinking about how to catch bugs like this with static analysis (a new Clippy lint). - The payload validation tests have been re-worked to support distinct responses from the mock EE for newPayload and forkchoiceUpdated. Three new tests have been added covering the `Invalid`, `InvalidBlockHash` and `InvalidTerminalBlock` cases. - I think we need a bunch more tests of different legal and illegal variations |
||
realbigsean
|
a80ccc3a33 |
1.57.0 lints (#2850)
## Issue Addressed New rust lints ## Proposed Changes - Boxing some enum variants - removing some unused fields (is the validator lockfile unused? seemed so to me) ## Additional Info - some error fields were marked as dead code but are logged out in areas - left some dead fields in our ef test code because I assume they are useful for debugging? Co-authored-by: realbigsean <seananderson33@gmail.com> |
||
Paul Hauner
|
8efd9fc324 |
Add AttesterCache for attestation production (#2478)
## Issue Addressed - Resolves #2169 ## Proposed Changes Adds the `AttesterCache` to allow validators to produce attestations for older slots. Presently, some arbitrary restrictions can force validators to receive an error when attesting to a slot earlier than the present one. This can cause attestation misses when there is excessive load on the validator client or time sync issues between the VC and BN. ## Additional Info NA |
||
Paul Hauner
|
6e3ca48cb9 |
Cache participating indices for Altair epoch processing (#2416)
## Issue Addressed NA ## Proposed Changes This PR addresses two things: 1. Allows the `ValidatorMonitor` to work with Altair states. 1. Optimizes `altair::process_epoch` (see [code](https://github.com/paulhauner/lighthouse/blob/participation-cache/consensus/state_processing/src/per_epoch_processing/altair/participation_cache.rs) for description) ## Breaking Changes The breaking changes in this PR revolve around one premise: *After the Altair fork, it's not longer possible (given only a `BeaconState`) to identify if a validator had *any* attestation included during some epoch. The best we can do is see if that validator made the "timely" source/target/head flags.* Whilst this seems annoying, it's not actually too bad. Finalization is based upon "timely target" attestations, so that's really the most important thing. Although there's *some* value in knowing if a validator had *any* attestation included, it's far more important to know about "timely target" participation, since this is what affects finality and justification. For simplicity and consistency, I've also removed the ability to determine if *any* attestation was included from metrics and API endpoints. Now, all Altair and non-Altair states will simply report on the head/target attestations. The following section details where we've removed fields and provides replacement values. ### Breaking Changes: Prometheus Metrics Some participation metrics have been removed and replaced. Some were removed since they are no longer relevant to Altair (e.g., total attesting balance) and others replaced with gwei values instead of pre-computed values. This provides more flexibility at display-time (e.g., Grafana). The following metrics were added as replacements: - `beacon_participation_prev_epoch_head_attesting_gwei_total` - `beacon_participation_prev_epoch_target_attesting_gwei_total` - `beacon_participation_prev_epoch_source_attesting_gwei_total` - `beacon_participation_prev_epoch_active_gwei_total` The following metrics were removed: - `beacon_participation_prev_epoch_attester` - instead use `beacon_participation_prev_epoch_source_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`. - `beacon_participation_prev_epoch_target_attester` - instead use `beacon_participation_prev_epoch_target_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`. - `beacon_participation_prev_epoch_head_attester` - instead use `beacon_participation_prev_epoch_head_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`. The `beacon_participation_prev_epoch_attester` endpoint has been removed. Users should instead use the pre-existing `beacon_participation_prev_epoch_target_attester`. ### Breaking Changes: HTTP API The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint loses the following fields: - `current_epoch_attesting_gwei` (use `current_epoch_target_attesting_gwei` instead) - `previous_epoch_attesting_gwei` (use `previous_epoch_target_attesting_gwei` instead) The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint lose the following fields: - `is_current_epoch_attester` (use `is_current_epoch_target_attester` instead) - `is_previous_epoch_attester` (use `is_previous_epoch_target_attester` instead) - `is_active_in_current_epoch` becomes `is_active_unslashed_in_current_epoch`. - `is_active_in_previous_epoch` becomes `is_active_unslashed_in_previous_epoch`. ## Additional Info NA ## TODO - [x] Deal with total balances - [x] Update validator_inclusion API - [ ] Ensure `beacon_participation_prev_epoch_target_attester` and `beacon_participation_prev_epoch_head_attester` work before Altair Co-authored-by: realbigsean <seananderson33@gmail.com> |
||
Michael Sproul
|
b4689e20c6 |
Altair consensus changes and refactors (#2279)
## Proposed Changes Implement the consensus changes necessary for the upcoming Altair hard fork. ## Additional Info This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`. Co-authored-by: realbigsean <seananderson33@gmail.com> |
||
realbigsean
|
b84ff9f793 |
rust 1.53.0 updates (#2411)
## Issue Addressed `make lint` failing on rust 1.53.0. ## Proposed Changes 1.53.0 updates ## Additional Info I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io> |
||
Paul Hauner
|
015ab7d0a7 |
Optimize validator duties (#2243)
## Issue Addressed Closes #2052 ## Proposed Changes - Refactor the attester/proposer duties endpoints in the BN - Performance improvements - Fixes some potential inconsistencies with the dependent root fields. - Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead. - Move the code for the proposer/attester duties endpoints into separate files, for readability. - Refactor the `DutiesService` in the VC - Required to reduce the delay on broadcasting new blocks. - Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API. - Separate block/attestation duty tasks so that they don't block each other when one is slow. - In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes. - Unfortunately this has created lots of dust changes. - In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont). - Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code. - This also fixes a bug with some functions which were failing to include a state root as per [this comment]( |
||
Michael Sproul
|
770a2ca030 |
Fix proposer cache priming upon state advance (#2252)
## Proposed Changes While investigating an incorrect head + target vote for the epoch boundary block 708544, I noticed that the state advance failed to prime the proposer cache, as per these logs: ``` Mar 09 21:42:47.448 DEBG Subscribing to subnet target_slot: 708544, subnet: Y, service: attestation_service Mar 09 21:49:08.063 DEBG Advanced head state one slot current_slot: 708543, state_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance Mar 09 21:49:08.063 DEBG Completed state advance initial_slot: 708543, advanced_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance Mar 09 21:49:14.787 DEBG Proposer shuffling cache miss block_slot: 708544, block_root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, parent_slot: 708543, parent_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: beacon Mar 09 21:49:14.800 DEBG Successfully processed gossip block root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, slot: 708544, graffiti: , service: beacon Mar 09 21:49:14.800 INFO New block received hash: 0x9b14…f9fe, slot: 708544 Mar 09 21:49:14.984 DEBG Head beacon block slot: 708544, root: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, justified_epoch: 22141, justified_root: 0x59db…e451, service: beacon Mar 09 21:49:15.055 INFO Unaggregated attestation validator: XXXXX, src: api, slot: 708544, epoch: 22142, delay_ms: 53, index: Y, head: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: val_mon Mar 09 21:49:17.001 DEBG Slot timer sync_state: Synced, current_slot: 708544, head_slot: 708544, head_block: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, peers: 55, service: slot_notifier ``` The reason for this is that the condition was backwards, so that whole block of code was unreachable. Looking at the attestations for the block included in the block after, we can see that lots of validators missed it. Some of them may be Lighthouse v1.1.1-v1.2.0-rc.0, but it's probable that they would have missed even with the proposer cache primed, given how late the block 708544 arrived (the cache miss occurred 3.787s after the slot start): https://beaconcha.in/block/708545#attestations |
||
Paul Hauner
|
4362ea4f98 |
Fix false positive "State advance too slow" logs (#2218)
## Issue Addressed - Resolves #2214 ## Proposed Changes Fix the false positive warning log described in #2214. ## Additional Info NA |
||
Paul Hauner
|
88cc222204 |
Advance state to next slot after importing block (#2174)
## Issue Addressed NA ## Proposed Changes Add an optimization to perform `per_slot_processing` from the *leading-edge* of block processing to the *trailing-edge*. Ultimately, this allows us to import the block at slot `n` faster because we used the tail-end of slot `n - 1` to perform `per_slot_processing`. Additionally, add a "block proposer cache" which allows us to cache the block proposer for some epoch. Since we're now doing trailing-edge `per_slot_processing`, we can prime this cache with the values for the next epoch before those blocks arrive (assuming those blocks don't have some weird forking). There were several ancillary changes required to achieve this: - Remove the `state_root` field of `BeaconSnapshot`, since there's no need to know it on a `pre_state` and in all other cases we can just read it from `block.state_root()`. - This caused some "dust" changes of `snapshot.beacon_state_root` to `snapshot.beacon_state_root()`, where the `BeaconSnapshot::beacon_state_root()` func just reads the state root from the block. - Rename `types::ShuffingId` to `AttestationShufflingId`. I originally did this because I added a `ProposerShufflingId` struct which turned out to be not so useful. I thought this new name was more descriptive so I kept it. - Address https://github.com/ethereum/eth2.0-specs/pull/2196 - Add a debug log when we get a block with an unknown parent. There was previously no logging around this case. - Add a function to `BeaconState` to compute all proposers for an epoch without re-computing the active indices for each slot. ## Additional Info - ~~Blocked on #2173~~ - ~~Blocked on #2179~~ That PR was wrapped into this PR. - There's potentially some places where we could avoid computing the proposer indices in `per_block_processing` but I haven't done this here. These would be an optimization beyond the issue at hand (improving block propagation times) and I think this PR is already doing enough. We can come back for that later. ## TODO - [x] Tidy, improve comments. - [x] ~~Try avoid computing proposer index in `per_block_processing`?~~ |