Improves some of the functionality around single and parent block lookup.
Gives extra information about whether failures for lookups are related to processing or downloading.
This is entirely untested.
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
Resolves#3314
## Proposed Changes
Add a module to encode/decode u256 types according to the execution layer encoding/decoding standards
https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#structures
Updates `JsonExecutionPayloadV1.base_fee_per_gas`, `JsonExecutionPayloadHeaderV1.base_fee_per_gas` and `TransitionConfigurationV1.terminal_total_difficulty` to encode/decode according to standards
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
Duplicate of #3269. Making this since @divagant-martian opened the previous PR and she can't approve her own PR 😄
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
N/A
## Proposed Changes
Since Rust 1.62, we can use `#[derive(Default)]` on enums. ✨https://blog.rust-lang.org/2022/06/30/Rust-1.62.0.html#default-enum-variants
There are no changes to functionality in this PR, just replaced the `Default` trait implementation with `#[derive(Default)]`.
## Issue Addressed
NA
## Proposed Changes
After some discussion in Discord with @mkalinin it was raised that it was not the intention of the engine API to have CLs validate the `latest_valid_hash` (LVH) and all ancestors.
Whilst I believe the engine API is being updated such that the LVH *must* identify a valid hash or be set to some junk value, I'm not confident that we can rely upon the LVH as being valid (at least for now) due to the confusion surrounding it.
Being able to validate blocks via the LVH is a relatively minor optimisation; if the LVH value ends up becoming our head we'll send an fcU and get the VALID status there.
Falsely marking a block as valid has serious consequences and since it's a minor optimisation to use LVH I think that we don't take the risk.
For clarity, we will still *invalidate* the *descendants* of the LVH, we just wont *validate* the *ancestors*.
## Additional Info
NA
## Issue Addressed
Resolves#3176
## Proposed Changes
Continues from PRs by @divagant-martian to gradually remove EL redundancy (see #3284, #3257).
This PR achieves:
- Removes the `broadcast` and `first_success` methods. The functional impact is that every request to the EE will always be tried immediately, regardless of the cached `EngineState` (this resolves#3176). Previously we would check the engine state before issuing requests, this doesn't make sense in a single-EE world; there's only one EE so we might as well try it for every request.
- Runs the upcheck/watchdog routine once per slot rather than thrice. When we had multiple EEs frequent polling was useful to try and detect when the primary EE had come back online and we could switch to it. That's not as relevant now.
- Always creates logs in the `Engines::upcheck` function. Previously we would mute some logs since they could get really noisy when one EE was down but others were functioning fine. Now we only have one EE and are upcheck-ing it less, it makes sense to always produce logs.
This PR purposefully does not achieve:
- Updating all occurances of "engines" to "engine". I'm trying to keep the diff small and manageable. We can come back for this.
## Additional Info
NA
## Issue Addressed
Follow up to https://github.com/sigp/lighthouse/pull/3290 that fixes a caching bug
## Proposed Changes
Build the committee cache for the new `POST /lighthouse/analysis/block_rewards` API. Due to an unusual quirk of the total active balance cache the API endpoint would sometimes fail after loading a state from disk which had a current epoch cache _but not_ a total active balance cache. This PR adds calls to build the caches immediately before they're required, and has been running smoothly with `blockdreamer` the last few days.
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/3270
## Proposed Changes
Optimize the calculation of historic beacon committees in the HTTP API.
This is achieved by allowing committee caches to be constructed for historic epochs, and constructing these committee caches on the fly in the API. This is much faster than reconstructing the state at the requested epoch, which usually takes upwards of 20s, and sometimes minutes with SPRP=8192. The depth of the `randao_mixes` array allows us to look back 64K epochs/0.8 years from a single state, which is pretty awesome!
We always use the `state_id` provided by the caller, but will return a nice 400 error if the epoch requested is out of range for the state requested, e.g.
```bash
# Prater
curl "http://localhost:5052/eth/v1/beacon/states/3170304/committees?epoch=33538"
```
```json
{"code":400,"message":"BAD_REQUEST: epoch out of bounds, try state at slot 1081344","stacktraces":[]}
```
Queries will be fastest when aligned to `slot % SPRP == 0`, so the hint suggests a slot that is 0 mod 8192.
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)) it uses [`getStateRootFromBlockRoot`](de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
## Issue Addressed
This PR is a subset of the changes in #3134. Unstable will still not function correctly with the new builder spec once this is merged, #3134 should be used on testnets
## Proposed Changes
- Removes redundancy in "builders" (servers implementing the builder spec)
- Renames `payload-builder` flag to `builder`
- Moves from old builder RPC API to new HTTP API, but does not implement the validator registration API (implemented in https://github.com/sigp/lighthouse/pull/3194)
Co-authored-by: sean <seananderson33@gmail.com>
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
Lays the groundwork for builder API changes by implementing the beacon-API's new `register_validator` endpoint
## Proposed Changes
- Add a routine in the VC that runs on startup (re-try until success), once per epoch or whenever `suggested_fee_recipient` is updated, signing `ValidatorRegistrationData` and sending it to the BN.
- TODO: `gas_limit` config options https://github.com/ethereum/builder-specs/issues/17
- BN only sends VC registration data to builders on demand, but VC registration data *does update* the BN's prepare proposer cache and send an updated fcU to a local EE. This is necessary for fee recipient consistency between the blinded and full block flow in the event of fallback. Having the BN only send registration data to builders on demand gives feedback directly to the VC about relay status. Also, since the BN has no ability to sign these messages anyways (so couldn't refresh them if it wanted), and validator registration is independent of the BN head, I think this approach makes sense.
- Adds upcoming consensus spec changes for this PR https://github.com/ethereum/consensus-specs/pull/2884
- I initially applied the bit mask based on a configured application domain.. but I ended up just hard coding it here instead because that's how it's spec'd in the builder repo.
- Should application mask appear in the api?
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
Resolves#3069
## Proposed Changes
Unify the `eth1-endpoints` and `execution-endpoints` flags in a backwards compatible way as described in https://github.com/sigp/lighthouse/issues/3069#issuecomment-1134219221
Users have 2 options:
1. Use multiple non auth execution endpoints for deposit processing pre-merge
2. Use a single jwt authenticated execution endpoint for both execution layer and deposit processing post merge
Related https://github.com/sigp/lighthouse/issues/3118
To enable jwt authenticated deposit processing, this PR removes the calls to `net_version` as the `net` namespace is not exposed in the auth server in execution clients.
Moving away from using `networkId` is a good step in my opinion as it doesn't provide us with any added guarantees over `chainId`. See https://github.com/ethereum/consensus-specs/issues/2163 and https://github.com/sigp/lighthouse/issues/2115
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
Add a new HTTP endpoint `POST /lighthouse/analysis/block_rewards` which takes a vec of `BeaconBlock`s as input and outputs the `BlockReward`s for them.
Augment the `BlockReward` struct with the attestation data for attestations in the block, which simplifies access to this information from blockprint. Using attestation data I've been able to make blockprint up to 95% accurate across Prysm/Lighthouse/Teku/Nimbus. I hope to go even higher using a bunch of synthetic blocks produced for Prysm/Nimbus/Lodestar, which are underrepresented in the current training data.
## Issue Addressed
Fixes#1864 and a bunch of other closed but unresolved issues.
## Proposed Changes
Allows the deposit caching to recover from `NonConsecutive` deposit errors by resetting the last processed block to the last valid deposit's block number. Still not sure of the underlying cause of this error, but this should recover the cache so we don't need `--eth1-purge-cache` anymore 🎉
A huge thanks to @one-three-three-seven for reproducing the error and providing the data that helped testing out the fix 🙌
Still needs a few more tests.
## Proposed Changes
Expand the set of paths tracked by the HTTP API metrics to include all paths hit by the validator client.
These paths were only partially updated for Altair, so we were missing some of the sync committee and v2 APIs.
## Issue Addressed
NA
## Proposed Changes
I used these logs when debugging a spurious failure with Infura and thought they might be nice to have around permanently.
There's no changes to functionality in this PR, just some additional `debug!` logs.
## Additional Info
NA
## Issue Addressed
Deprecates the step parameter in the blocks by range request
## Proposed Changes
- Modifies the BlocksByRangeRequest type to remove the step parameter and everywhere we took it into account before
- Adds a new type to still handle coding and decoding of requests that use the parameter
## Additional Info
I went with a deprecation over the type itself so that requests received outside `lighthouse_network` don't even need to deal with this parameter. After the deprecation period just removing the Old blocks by range request should be straightforward
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2944
## Proposed Changes
Remove snapshots from the cache during sync rather than cloning them. This reduces unnecessary cloning and memory fragmentation during sync.
## Additional Info
This PR relies on the fact that the `block_delay` cache is not populated for blocks from sync. Relying on block delay may have the side effect that a change in `block_delay` calculation could lead to: a) more clones, if block delays are added for syncing blocks or b) less clones, if blocks near the head are erroneously provided without a `block_delay`. Case (a) would be a regression to the current status quo, and (b) is low-risk given we know that the snapshot cache is current susceptible to misses (hence `tree-states`).
## Issue Addressed
#2820
## Proposed Changes
The problem is that validator_monitor_prev_epoch metrics are updated only if there is EpochSummary present in summaries map for the previous epoch and it is not the case for the offline validator. Ensure that EpochSummary is inserted into summaries map also for the offline validators.
## Issue Addressed
Partly resolves https://github.com/sigp/lighthouse/issues/3032
## Proposed Changes
Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work.
Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
NA
## Proposed Changes
- Uses a `Vec` in `SingleEpochParticipationCache` rather than `HashMap` to speed up processing times at the cost of memory usage.
- Cache the result of `integer_sqrt` rather than recomputing for each validator.
- Cache `state.previous_epoch` rather than recomputing it for each validator.
### Benchmarks
Benchmarks on a recent mainnet state using #3252 to get timing.
#### Without this PR
```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Using mainnet spec
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Advancing 32 slots
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Doing 10 runs
[2022-06-09T08:21:02Z INFO lcli::skip_slots] State path: "/tmp/state-0x3cdc.ssz"
SSZ decoding /tmp/state-0x3cdc.ssz: 43ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 0: 245.718794ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 1: 245.364782ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 2: 255.866179ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 3: 243.838909ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 4: 250.431425ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 5: 248.68765ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 6: 262.051113ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 7: 264.293967ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 8: 293.202007ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 9: 264.552017ms
```
#### With this PR:
```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 0: 73.898678ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 1: 75.536978ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 2: 75.176104ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 3: 76.460828ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 4: 75.904195ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 5: 75.53077ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 6: 74.745572ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 7: 75.823489ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 8: 74.892055ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 9: 76.333569ms
```
## Additional Info
NA
## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094
## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs
## Additional Info
We still get a counter on ignored attempts.. just in case
## Issue Addressed
na
## Proposed Changes
Updates libp2p to https://github.com/libp2p/rust-libp2p/pull/2662
## Additional Info
From comments on the relevant PRs listed, we should pay attention at peer management consistency, but I don't think anything weird will happen.
This is running in prater tok and sin
## Issue Addressed
Fixes a timing issue that results in spurious fork choice notifier failures:
```
WARN Error signalling fork choice waiter slot: 3962270, error: ForkChoiceSignalOutOfOrder { current: Slot(3962271), latest: Slot(3962270) }, service: beacon
```
There’s a fork choice run that is scheduled to run at the start of every slot by the `timer`, which creates a 12s interval timer when the beacon node starts up. The problem is that if there’s a bit of clock drift that gets corrected via NTP (or a leap second for that matter) then these 12s intervals will cease to line up with the start of the slot. This then creates the mismatch in slot number that we see above.
Lighthouse also runs fork choice 500ms before the slot begins, and these runs are what is conflicting with the start-of-slot runs. This means that the warning in current versions of Lighthouse is mostly cosmetic because fork choice is up to date with all but the most recent 500ms of attestations (which usually isn’t many).
## Proposed Changes
Fix the per-slot timer so that it continually re-calculates the duration to the start of the next slot and waits for that.
A side-effect of this change is that we may skip slots if the per-slot task takes >12s to run, but I think this is an unlikely scenario and an acceptable compromise.
## Issue Addressed
Reduces the effect of late blocks on overall node buildup
## Proposed Changes
change the capacity of the channels used to send work for reprocessing in the beacon processor, and to send back to the main processor task, to be 75% of the capacity of the channel for receiving new events
## Additional Info
The issues we've seen suggest we should still evaluate node performance under stress, with late blocks being a big factor.
Other changes that could help:
1. right now we have a cap for queued attestations for reprocessing that applies to the sum of aggregated and unaggregated attestations. We could consider adding a separate cap that favors aggregated ones.
2. solving #2848
## Issue Addressed
Fix for the eth1 cache sync issue observed on Ropsten.
## Proposed Changes
Ropsten blocks are so infrequent that they broke our algorithm for downloading eth1 blocks. We currently try to download forwards from the last block in our cache to the block with block number [`remote_highest_block - FOLLOW_DISTANCE + FOLLOW_DISTANCE / ETH1_BLOCK_TIME_TOLERANCE_FACTOR`](6f732986f1/beacon_node/eth1/src/service.rs (L489-L492)). With the tolerance set to 4 this is insufficient because we lag by 1536 blocks, which is more like ~14 hours on Ropsten. This results in us having an incomplete eth1 cache, because we should cache all blocks between -16h and -8h. Even if we were to set the tolerance to 2 for the largest allowance, we would only look back 1024 blocks which is still more than 8 hours.
For example consider this block https://ropsten.etherscan.io/block/12321390. The block from 1536 blocks earlier is 14 hours and 20 minutes before it: https://ropsten.etherscan.io/block/12319854. The block from 1024 blocks earlier is https://ropsten.etherscan.io/block/12320366, 8 hours and 48 minutes before.
- This PR introduces a new CLI flag called `--eth1-cache-follow-distance` which can be used to set the distance manually.
- A new dynamic catchup mechanism is added which detects when the cache is lagging the true eth1 chain and tries to download more blocks within the follow distance in order to catch up.
## Issue Addressed
#3156
## Proposed Changes
Emit a `WARN` log whenever the value of `fee_recipient` as returned from the EE is different from the value of `suggested_fee_recipient` as set on the BN, for example by the `--suggested-fee-recipient` CLI flag.
## Additional Info
I have set the log level to `WARN` since it is legal behaviour (meaning it isn't really an error but is important to know when it is occurring).
If we feel like this behaviour is almost always undesired (caused by a misconfiguration or malicious EE) then an `ERRO` log would be more appropriate. Happy to change it in that case.
## Issue Addressed
We were logging `out_finalized_epoch` instead of `our_finalized_epoch`. I noticed this ages ago but only just got around to fixing it.
## Additional Info
I also reformatted the log line to respect the line length limit (`rustfmt` won't do it because it gets confused by the `;` in slog's log macros).
## Issue Addressed
NA
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
- Pending testing on our infra. **Please do not merge**
## Issue Addressed
#3212
## Proposed Changes
Move chain segments coming from back-fill syncing from highest priority to lowest
## Additional Info
If this does not solve the issue, next steps would be lowering the batch size for back-fill sync, and as last resort throttling the processing of these chain segments
## Issue Addressed
Fixes an issue that @paulhauner found with the v2.3.0 release candidate whereby the fork choice runs introduced by #3168 tripped over each other during sync:
```
May 24 23:06:40.542 WARN Error signalling fork choice waiter slot: 3884129, error: ForkChoiceSignalOutOfOrder { current: Slot(3884131), latest: Slot(3884129) }, service: beacon
```
This can occur because fork choice is called from the state advance _and_ the per-slot task. When one of these runs takes a long time it can end up finishing after a run from a later slot, tripping the error above. The problem is resolved by not running either of these fork choice calls during sync.
Additionally, these parallel fork choice runs were causing issues in the database:
```
May 24 07:49:05.098 WARN Found a chain that should already have been pruned, head_slot: 92925, head_block_root: 0xa76c7bf1b98e54ed4b0d8686efcfdf853484e6c2a4c67e91cbf19e5ad1f96b17, service: beacon
May 24 07:49:05.101 WARN Database migration failed error: HotColdDBError(FreezeSlotError { current_split_slot: Slot(92608), proposed_split_slot: Slot(92576) }), service: beacon
```
In this case, two fork choice calls triggering the finalization processing were being processed out of order due to differences in their processing time, causing the background migrator to try to advance finalization _backwards_ 😳. Removing the parallel fork choice runs from sync effectively addresses the issue, because these runs are most likely to have different finalized checkpoints (because of the speed at which fork choice advances during sync). In theory it's still possible to process updates out of order if any other fork choice runs end up completing out of order, but this should be much less common. Fixing out of order fork choice runs in general is difficult as it requires architectural changes like serialising fork choice updates through a single thread, or locking fork choice along with the head when it is mutated (https://github.com/sigp/lighthouse/pull/3175).
## Proposed Changes
* Don't run per-slot fork choice during sync (if head is older than 4 slots)
* Don't run state-advance fork choice during sync (if head is older than 4 slots)
* Check for monotonic finalization updates in the background migrator. This is a good defensive check to have, and I'm not sure why we didn't have it before (we may have had it and wrongly removed it).
*This PR was adapted from @pawanjay176's work in #3197.*
## Issue Addressed
Fixes a regression in https://github.com/sigp/lighthouse/pull/3168
## Proposed Changes
https://github.com/sigp/lighthouse/pull/3168 added calls to `fork_choice` in `BeaconChain::per_slot_task` function. This leads to a panic as `per_slot_task` is called from an async context which calls fork choice, which then calls `block_on`.
This PR changes the timer to call the `per_slot_task` function in a blocking thread.
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
## Issue Addressed
This fixes the low-hanging Clippy lints introduced in Rust 1.61 (due any hour now). It _ignores_ one lint, because fixing it requires a structural refactor of the validator client that needs to be done delicately. I've started on that refactor and will create another PR that can be reviewed in more depth in the coming days. I think we should merge this PR in the meantime to unblock CI.
## Issue Addressed
Upcoming spec change https://github.com/ethereum/consensus-specs/pull/2878
## Proposed Changes
1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block.
2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_.
3. Remove the fork choice run from the state advance timer that occurred before advancing the state.
## Additional Info
### Block Proposal Accuracy
This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue.
### Attestation Accuracy
This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B.
### Why remove the call before the state advance?
If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea).
### Performance
Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes.
### Implementation
Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.
## Issue Addressed
@z3n-chada is currently getting a `PayloadIdUnavailable` error when connecting lighthouse to Erigon and it's difficult to discern why so this just logs out the response status from the EE when we hit an `PayloadIdUnavailable` error
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Proposed Changes
Remove support for DB migrations that support upgrading from schema's below version 5. This is mostly for cosmetic/code quality reasons as in most circumstances upgrading from versions of Lighthouse this old will almost always require a re-sync.
## Additional Info
The minimum supported database schema is now version 5.