## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2944
## Proposed Changes
Remove snapshots from the cache during sync rather than cloning them. This reduces unnecessary cloning and memory fragmentation during sync.
## Additional Info
This PR relies on the fact that the `block_delay` cache is not populated for blocks from sync. Relying on block delay may have the side effect that a change in `block_delay` calculation could lead to: a) more clones, if block delays are added for syncing blocks or b) less clones, if blocks near the head are erroneously provided without a `block_delay`. Case (a) would be a regression to the current status quo, and (b) is low-risk given we know that the snapshot cache is current susceptible to misses (hence `tree-states`).
## Issue Addressed
#2820
## Proposed Changes
The problem is that validator_monitor_prev_epoch metrics are updated only if there is EpochSummary present in summaries map for the previous epoch and it is not the case for the offline validator. Ensure that EpochSummary is inserted into summaries map also for the offline validators.
## Issue Addressed
Partly resolves https://github.com/sigp/lighthouse/issues/3032
## Proposed Changes
Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work.
Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
NA
## Proposed Changes
- Uses a `Vec` in `SingleEpochParticipationCache` rather than `HashMap` to speed up processing times at the cost of memory usage.
- Cache the result of `integer_sqrt` rather than recomputing for each validator.
- Cache `state.previous_epoch` rather than recomputing it for each validator.
### Benchmarks
Benchmarks on a recent mainnet state using #3252 to get timing.
#### Without this PR
```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Using mainnet spec
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Advancing 32 slots
[2022-06-09T08:21:02Z INFO lcli::skip_slots] Doing 10 runs
[2022-06-09T08:21:02Z INFO lcli::skip_slots] State path: "/tmp/state-0x3cdc.ssz"
SSZ decoding /tmp/state-0x3cdc.ssz: 43ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 0: 245.718794ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 1: 245.364782ms
[2022-06-09T08:21:03Z INFO lcli::skip_slots] Run 2: 255.866179ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 3: 243.838909ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 4: 250.431425ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 5: 248.68765ms
[2022-06-09T08:21:04Z INFO lcli::skip_slots] Run 6: 262.051113ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 7: 264.293967ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 8: 293.202007ms
[2022-06-09T08:21:05Z INFO lcli::skip_slots] Run 9: 264.552017ms
```
#### With this PR:
```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 0: 73.898678ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 1: 75.536978ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 2: 75.176104ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 3: 76.460828ms
[2022-06-09T08:57:59Z INFO lcli::skip_slots] Run 4: 75.904195ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 5: 75.53077ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 6: 74.745572ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 7: 75.823489ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 8: 74.892055ms
[2022-06-09T08:58:00Z INFO lcli::skip_slots] Run 9: 76.333569ms
```
## Additional Info
NA
## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094
## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs
## Additional Info
We still get a counter on ignored attempts.. just in case
## Issue Addressed
na
## Proposed Changes
Updates libp2p to https://github.com/libp2p/rust-libp2p/pull/2662
## Additional Info
From comments on the relevant PRs listed, we should pay attention at peer management consistency, but I don't think anything weird will happen.
This is running in prater tok and sin
## Issue Addressed
Fixes a timing issue that results in spurious fork choice notifier failures:
```
WARN Error signalling fork choice waiter slot: 3962270, error: ForkChoiceSignalOutOfOrder { current: Slot(3962271), latest: Slot(3962270) }, service: beacon
```
There’s a fork choice run that is scheduled to run at the start of every slot by the `timer`, which creates a 12s interval timer when the beacon node starts up. The problem is that if there’s a bit of clock drift that gets corrected via NTP (or a leap second for that matter) then these 12s intervals will cease to line up with the start of the slot. This then creates the mismatch in slot number that we see above.
Lighthouse also runs fork choice 500ms before the slot begins, and these runs are what is conflicting with the start-of-slot runs. This means that the warning in current versions of Lighthouse is mostly cosmetic because fork choice is up to date with all but the most recent 500ms of attestations (which usually isn’t many).
## Proposed Changes
Fix the per-slot timer so that it continually re-calculates the duration to the start of the next slot and waits for that.
A side-effect of this change is that we may skip slots if the per-slot task takes >12s to run, but I think this is an unlikely scenario and an acceptable compromise.
## Issue Addressed
Reduces the effect of late blocks on overall node buildup
## Proposed Changes
change the capacity of the channels used to send work for reprocessing in the beacon processor, and to send back to the main processor task, to be 75% of the capacity of the channel for receiving new events
## Additional Info
The issues we've seen suggest we should still evaluate node performance under stress, with late blocks being a big factor.
Other changes that could help:
1. right now we have a cap for queued attestations for reprocessing that applies to the sum of aggregated and unaggregated attestations. We could consider adding a separate cap that favors aggregated ones.
2. solving #2848
## Issue Addressed
Fix for the eth1 cache sync issue observed on Ropsten.
## Proposed Changes
Ropsten blocks are so infrequent that they broke our algorithm for downloading eth1 blocks. We currently try to download forwards from the last block in our cache to the block with block number [`remote_highest_block - FOLLOW_DISTANCE + FOLLOW_DISTANCE / ETH1_BLOCK_TIME_TOLERANCE_FACTOR`](6f732986f1/beacon_node/eth1/src/service.rs (L489-L492)). With the tolerance set to 4 this is insufficient because we lag by 1536 blocks, which is more like ~14 hours on Ropsten. This results in us having an incomplete eth1 cache, because we should cache all blocks between -16h and -8h. Even if we were to set the tolerance to 2 for the largest allowance, we would only look back 1024 blocks which is still more than 8 hours.
For example consider this block https://ropsten.etherscan.io/block/12321390. The block from 1536 blocks earlier is 14 hours and 20 minutes before it: https://ropsten.etherscan.io/block/12319854. The block from 1024 blocks earlier is https://ropsten.etherscan.io/block/12320366, 8 hours and 48 minutes before.
- This PR introduces a new CLI flag called `--eth1-cache-follow-distance` which can be used to set the distance manually.
- A new dynamic catchup mechanism is added which detects when the cache is lagging the true eth1 chain and tries to download more blocks within the follow distance in order to catch up.
## Issue Addressed
#3156
## Proposed Changes
Emit a `WARN` log whenever the value of `fee_recipient` as returned from the EE is different from the value of `suggested_fee_recipient` as set on the BN, for example by the `--suggested-fee-recipient` CLI flag.
## Additional Info
I have set the log level to `WARN` since it is legal behaviour (meaning it isn't really an error but is important to know when it is occurring).
If we feel like this behaviour is almost always undesired (caused by a misconfiguration or malicious EE) then an `ERRO` log would be more appropriate. Happy to change it in that case.
## Issue Addressed
We were logging `out_finalized_epoch` instead of `our_finalized_epoch`. I noticed this ages ago but only just got around to fixing it.
## Additional Info
I also reformatted the log line to respect the line length limit (`rustfmt` won't do it because it gets confused by the `;` in slog's log macros).
## Issue Addressed
NA
## Proposed Changes
Please list or describe the changes introduced by this PR.
## Additional Info
- Pending testing on our infra. **Please do not merge**
## Issue Addressed
#3212
## Proposed Changes
Move chain segments coming from back-fill syncing from highest priority to lowest
## Additional Info
If this does not solve the issue, next steps would be lowering the batch size for back-fill sync, and as last resort throttling the processing of these chain segments
## Issue Addressed
Fixes an issue that @paulhauner found with the v2.3.0 release candidate whereby the fork choice runs introduced by #3168 tripped over each other during sync:
```
May 24 23:06:40.542 WARN Error signalling fork choice waiter slot: 3884129, error: ForkChoiceSignalOutOfOrder { current: Slot(3884131), latest: Slot(3884129) }, service: beacon
```
This can occur because fork choice is called from the state advance _and_ the per-slot task. When one of these runs takes a long time it can end up finishing after a run from a later slot, tripping the error above. The problem is resolved by not running either of these fork choice calls during sync.
Additionally, these parallel fork choice runs were causing issues in the database:
```
May 24 07:49:05.098 WARN Found a chain that should already have been pruned, head_slot: 92925, head_block_root: 0xa76c7bf1b98e54ed4b0d8686efcfdf853484e6c2a4c67e91cbf19e5ad1f96b17, service: beacon
May 24 07:49:05.101 WARN Database migration failed error: HotColdDBError(FreezeSlotError { current_split_slot: Slot(92608), proposed_split_slot: Slot(92576) }), service: beacon
```
In this case, two fork choice calls triggering the finalization processing were being processed out of order due to differences in their processing time, causing the background migrator to try to advance finalization _backwards_ 😳. Removing the parallel fork choice runs from sync effectively addresses the issue, because these runs are most likely to have different finalized checkpoints (because of the speed at which fork choice advances during sync). In theory it's still possible to process updates out of order if any other fork choice runs end up completing out of order, but this should be much less common. Fixing out of order fork choice runs in general is difficult as it requires architectural changes like serialising fork choice updates through a single thread, or locking fork choice along with the head when it is mutated (https://github.com/sigp/lighthouse/pull/3175).
## Proposed Changes
* Don't run per-slot fork choice during sync (if head is older than 4 slots)
* Don't run state-advance fork choice during sync (if head is older than 4 slots)
* Check for monotonic finalization updates in the background migrator. This is a good defensive check to have, and I'm not sure why we didn't have it before (we may have had it and wrongly removed it).
*This PR was adapted from @pawanjay176's work in #3197.*
## Issue Addressed
Fixes a regression in https://github.com/sigp/lighthouse/pull/3168
## Proposed Changes
https://github.com/sigp/lighthouse/pull/3168 added calls to `fork_choice` in `BeaconChain::per_slot_task` function. This leads to a panic as `per_slot_task` is called from an async context which calls fork choice, which then calls `block_on`.
This PR changes the timer to call the `per_slot_task` function in a blocking thread.
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
## Issue Addressed
This fixes the low-hanging Clippy lints introduced in Rust 1.61 (due any hour now). It _ignores_ one lint, because fixing it requires a structural refactor of the validator client that needs to be done delicately. I've started on that refactor and will create another PR that can be reviewed in more depth in the coming days. I think we should merge this PR in the meantime to unblock CI.
## Issue Addressed
Upcoming spec change https://github.com/ethereum/consensus-specs/pull/2878
## Proposed Changes
1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block.
2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_.
3. Remove the fork choice run from the state advance timer that occurred before advancing the state.
## Additional Info
### Block Proposal Accuracy
This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue.
### Attestation Accuracy
This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B.
### Why remove the call before the state advance?
If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea).
### Performance
Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes.
### Implementation
Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.
## Issue Addressed
@z3n-chada is currently getting a `PayloadIdUnavailable` error when connecting lighthouse to Erigon and it's difficult to discern why so this just logs out the response status from the EE when we hit an `PayloadIdUnavailable` error
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Proposed Changes
Remove support for DB migrations that support upgrading from schema's below version 5. This is mostly for cosmetic/code quality reasons as in most circumstances upgrading from versions of Lighthouse this old will almost always require a re-sync.
## Additional Info
The minimum supported database schema is now version 5.
## Issue Addressed
N/A
## Proposed Changes
Prevents the early attester cache from producing attestations to future blocks. This bug could result in a missed head vote if the BN was requested to produce an attestation for an earlier slot than the head block during the (usually) short window of time between verifying a block and setting it as the head.
This bug was noticed in an [Antithesis](https://andreagrieser.com/) test and diagnosed by @realbigsean.
## Additional Info
NA
# Description
Since the `TaskExecutor` currently requires a `Weak<Runtime>`, it's impossible to use it in an async test where the `Runtime` is created outside our scope. Whilst we *could* create a new `Runtime` instance inside the async test, dropping that `Runtime` would cause a panic (you can't drop a `Runtime` in an async context).
To address this issue, this PR creates the `enum Handle`, which supports either:
- A `Weak<Runtime>` (for use in our production code)
- A `Handle` to a runtime (for use in testing)
In theory, there should be no change to the behaviour of our production code (beyond some slightly different descriptions in HTTP 500 errors), or even our tests. If there is no change, you might ask *"why bother?"*. There are two PRs (#3070 and #3175) that are waiting on these fixes to introduce some new tests. Since we've added the EL to the `BeaconChain` (for the merge), we are now doing more async stuff in tests.
I've also added a `RuntimeExecutor` to the `BeaconChainTestHarness`. Whilst that's not immediately useful, it will become useful in the near future with all the new async testing.
Code simplifications using `Option`/`Result` combinators to make pattern-matches a tad simpler.
Opinions on these loosely held, happy to adjust in review.
Tool-aided by [comby-rust](https://github.com/huitseeker/comby-rust).
## Proposed Changes
Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.
⚠️ **This is achieved in a backwards-incompatible way for networks that have already merged** ⚠️. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.
The main changes are:
- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
- `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
- `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
- I've tested manually that it works on Kiln, using Geth and Nethermind.
- This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
- We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).
## Additional Info
- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Fix a panic due to misuse of the Tokio executor when processing a forkchoiceUpdated response. We were previously calling `process_invalid_execution_payload` from the async function `update_execution_engine_forkchoice_async`, which resulted in a panic because `process_invalid_execution_payload` contains a call to fork choice, which ultimately calls `block_on`.
An example backtrace can be found here: https://gist.github.com/michaelsproul/ac5da03e203d6ffac672423eaf52fb20
## Proposed Changes
Wrap the call to `process_invalid_execution_payload` in a `spawn_blocking` so that `block_on` is no longer called from an async context.
## Additional Info
- I've been thinking about how to catch bugs like this with static analysis (a new Clippy lint).
- The payload validation tests have been re-worked to support distinct responses from the mock EE for newPayload and forkchoiceUpdated. Three new tests have been added covering the `Invalid`, `InvalidBlockHash` and `InvalidTerminalBlock` cases.
- I think we need a bunch more tests of different legal and illegal variations
## Issue Addressed
N/A
## Proposed Changes
Previously, we were using `Sleep::is_elapsed()` to check if the shutdown timeout had triggered without polling the sleep. This PR polls the sleep timer.
## Issue Addressed
We still ping peers that are considered in a disconnecting state
## Proposed Changes
Do not ping peers once we decide they are disconnecting
Upgrade logs about ignored rpc messages
## Additional Info
--
## Issue Addressed
NA
## Proposed Changes
Disallow the production of attestations and retrieval of unaggregated attestations when they reference an optimistic head. Add tests to this end.
I also moved `BeaconChain::produce_unaggregated_attestation_for_block` to the `BeaconChainHarness`. It was only being used during tests, so it's nice to stop pretending it's production code. I also needed something that could produce attestations to optimistic blocks in order to simulate scenarios where the justified checkpoint is determined invalid (if no one would attest to an optimistic block, we could never justify it and then flip it to invalid).
## Additional Info
- ~~Blocked on #3126~~
## Issue Addressed
In very rare occasions we've seen most if not all our peers in a chain with which we don't agree. Purging these peers can take a very long time: number of retries of the chain. Meanwhile sync is caught in a loop trying the chain again and again. This makes it so that we fast track purging peers via registering the failed chain to prevent retrying for some time (30 seconds). Longer times could be dangerous since a chain can fail if a batch fails to download for example. In this case, I think it's still acceptable to fast track purging peers since they are nor providing the required info anyway
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
## Issue Addressed
Addresses sync stalls on v2.2.0 (i.e. https://github.com/sigp/lighthouse/issues/3147).
## Additional Info
I've avoided doing a full `cargo update` because I noticed there's a new patch version of libp2p and thought it could do with some more testing.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
NA
## Proposed Changes
Fixes an issue introduced in #3088 which was causing unnecessary `crit` logs on networks without Bellatrix enabled.
## Additional Info
NA
## Issue Addressed
N/A
## Proposed Changes
https://github.com/sigp/lighthouse/pull/3133 changed the rpc type limits to be fork aware i.e. if our current fork based on wall clock slot is Altair, then we apply only altair rpc type limits. This is a bug because phase0 blocks can still be sent over rpc and phase 0 block minimum size is smaller than altair block minimum size. So a phase0 block with `size < SIGNED_BEACON_BLOCK_ALTAIR_MIN` will return an `InvalidData` error as it doesn't pass the rpc types bound check.
This error can be seen when we try syncing pre-altair blocks with size smaller than `SIGNED_BEACON_BLOCK_ALTAIR_MIN`.
This PR fixes the issue by also accounting for forks earlier than current_fork in the rpc limits calculation in the `rpc_block_limits_by_fork` function. I decided to hardcode the limits in the function because that seemed simpler than calculating previous forks based on current fork and doing a min across forks. Adding a new fork variant is simple and can the limits can be easily checked in a review.
Adds unit tests and modifies the syncing simulator to check the syncing from across fork boundaries.
The syncing simulator's block 1 would always be of phase 0 minimum size (404 bytes) which is smaller than altair min block size (since block 1 contains no attestations).
## Issue Addressed
NA
## Proposed Changes
Ensures that a `VALID` response from a `forkchoiceUpdate` call will update that block in `ProtoArray`.
I also had to modify the mock execution engine so it wouldn't return valid when all payloads were supposed to be some other static value.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Adds more checks to prevent importing blocks atop parent with invalid execution payloads.
- Adds a test for these conditions.
## Additional Info
NA
## Proposed Changes
Cut release v2.2.0 including proposer boost.
## Additional Info
I also updated the clippy lints for the imminent release of Rust 1.60, although LH v2.2.0 will continue to compile using Rust 1.58 (our MSRV).
## Proposed Changes
I did some gardening 🌳 in our dependency tree:
- Remove duplicate versions of `warp` (git vs patch)
- Remove duplicate versions of lots of small deps: `cpufeatures`, `ethabi`, `ethereum-types`, `bitvec`, `nix`, `libsecp256k1`.
- Update MDBX (should resolve#3028). I tested and Lighthouse compiles on Windows 11 now.
- Restore `psutil` back to upstream
- Make some progress updating everything to rand 0.8. There are a few crates stuck on 0.7.
Hopefully this puts us on a better footing for future `cargo audit` issues, and improves compile times slightly.
## Additional Info
Some crates are held back by issues with `zeroize`. libp2p-noise depends on [`chacha20poly1305`](https://crates.io/crates/chacha20poly1305) which depends on zeroize < v1.5, and we can only have one version of zeroize because it's post 1.0 (see https://github.com/rust-lang/cargo/issues/6584). The latest version of `zeroize` is v1.5.4, which is used by the new versions of many other crates (e.g. `num-bigint-dig`). Once a new version of chacha20poly1305 is released we can update libp2p-noise and upgrade everything to the latest `zeroize` version.
I've also opened a PR to `blst` related to zeroize: https://github.com/supranational/blst/pull/111
## Issue Addressed
N/A
## Proposed Changes
Fix the upper bound for blocks by root responses to be equal to the max merge block size instead of altair.
Further make the rpc response limits fork aware.
## Proposed Changes
Increase the default `--slots-per-restore-point` to 8192 for a 4x reduction in freezer DB disk usage.
Existing nodes that use the previous default of 2048 will be left unchanged. Newly synced nodes (with or without checkpoint sync) will use the new 8192 default.
Long-term we could do away with the freezer DB entirely for validator-only nodes, but this change is much simpler and grants us some extra space in the short term. We can also roll it out gradually across our nodes by purging databases one by one, while keeping the Ansible config the same.
## Additional Info
We ignore a change from 2048 to 8192 if the user hasn't set the 8192 explicitly. We fire a debug log in the case where we do ignore:
```
DEBG Ignoring slots-per-restore-point config in favour of on-disk value, on_disk: 2048, config: 8192
```
## Proposed Changes
Add a `lighthouse db` command with three initial subcommands:
- `lighthouse db version`: print the database schema version.
- `lighthouse db migrate --to N`: manually upgrade (or downgrade!) the database to a different version.
- `lighthouse db inspect --column C`: log the key and size in bytes of every value in a given `DBColumn`.
This PR lays the groundwork for other changes, namely:
- Mark's fast-deposit sync (https://github.com/sigp/lighthouse/pull/2915), for which I think we should implement a database downgrade (from v9 to v8).
- My `tree-states` work, which already implements a downgrade (v10 to v8).
- Standalone purge commands like `lighthouse db purge-dht` per https://github.com/sigp/lighthouse/issues/2824.
## Additional Info
I updated the `strum` crate to 0.24.0, which necessitated some changes in the network code to remove calls to deprecated methods.
Thanks to @winksaville for the motivation, and implementation work that I used as a source of inspiration (https://github.com/sigp/lighthouse/pull/2685).
## Issue Addressed
MEV boost compatibility
## Proposed Changes
See #2987
## Additional Info
This is blocked on the stabilization of a couple specs, [here](https://github.com/ethereum/beacon-APIs/pull/194) and [here](https://github.com/flashbots/mev-boost/pull/20).
Additional TODO's and outstanding questions
- [ ] MEV boost JWT Auth
- [ ] Will `builder_proposeBlindedBlock` return the revealed payload for the BN to propogate
- [ ] Should we remove `private-tx-proposals` flag and communicate BN <> VC with blinded blocks by default once these endpoints enter the beacon-API's repo? This simplifies merge transition logic.
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
N/A
## Proposed Changes
- Update the JSON-RPC id field for both our request and response objects to be a `serde_json::Value` rather than a `u32`. This field could be a string or a number according to the JSON-RPC 2.0 spec. We only ever set it to a number, but if, for example, we get a response that wraps this number in quotes, we would fail to deserialize it. I think because we're not doing any validation around this id otherwise, we should be less strict with it in this regard.
## Additional Info
Co-authored-by: realbigsean <sean@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
In the first Goerli shadow-fork, Lighthouse was getting timeouts from Geth which prevented the LH+Geth pair from progressing.
There's not a whole lot of information I can use to set these timeouts. The most interesting pieces of information I have are quotes from Marius from Geth:
- *"Fcu also needs to construct the block which can take 2sec"* ([Discord](https://discord.com/channels/595666850260713488/910910348922589184/958006487052066836))
- *"2 sec should be ok for new payload, weird that it times out"* ([Discord](https://discord.com/channels/595666850260713488/910910348922589184/958006487052066836))
I don't think we should be so worried about getting these timeouts correct now. No one really knows how long the various EEs are going to take, it's a bit too early in development. With these changes I'm giving some headroom so that we don't fail just because EEs are quite optimized enough. I've set the value to 6s (half a mainnet slot), since I think anything beyond 6s is an interesting problem that we want to know about sooner rather than later.
## Additional Info
NA
## Issue Addressed
N/A
## Proposed Changes
Set the engine state to `EngineState::Offline` if the engine api call fails during broadcast. This caused issues while pausing sync when the execution engine is offline because `EngineState` always returned `Synced`.
## Proposed Changes
Allow Lighthouse to speculatively create blocks via the `/eth/v1/validators/blocks` endpoint by optionally skipping the RANDAO verification that we introduced in #2740. When `verify_randao=false` is passed as a query parameter the `randao_reveal` is not required to be present, and if present will only be lightly checked (must be a valid BLS sig). If `verify_randao` is omitted it defaults to true and Lighthouse behaves exactly as it did previously, hence this PR is backwards-compatible.
I'd like to get this change into `unstable` pretty soon as I've got 3 projects building on top of it:
- [`blockdreamer`](https://github.com/michaelsproul/blockdreamer), which mocks block production every slot in order to fingerprint clients
- analysis of Lighthouse's block packing _optimality_, which uses `blockdreamer` to extract interesting instances of the attestation packing problem
- analysis of Lighthouse's block packing _performance_ (as in speed) on the `tree-states` branch
## Additional Info
Having tested `blockdreamer` with Prysm, Nimbus and Teku I noticed that none of them verify the randao signature on `/eth/v1/validator/blocks`. I plan to open a PR to the `beacon-APIs` repo anyway so that this parameter can be standardised in case the other clients add RANDAO verification by default in future.
## Issue Addressed
No issue, just updating merge ASCII art.
## Proposed Changes
Updating ASCII art for merge.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
#3103
## Proposed Changes
Parse `http-address` and `metrics-address` as `IpAddr` for both the beacon node and validator client to support IPv6 addresses.
Also adjusts parsing of CORS origins to allow for IPv6 addresses.
## Usage
You can now set `http-address` and/or `metrics-address` flags to IPv6 addresses.
For example, the following:
`lighthouse bn --http --http-address :: --metrics --metrics-address ::1`
will expose the beacon node HTTP server on `[::]` (equivalent of `0.0.0.0` in IPv4) and the metrics HTTP server on `localhost` (the equivalent of `127.0.0.1` in IPv4)
The beacon node API can then be accessed by:
`curl "http://[server-ipv6-address]:5052/eth/v1/some_endpoint"`
And the metrics server api can be accessed by:
`curl "http://localhost:5054/metrics"` or by `curl "http://[::1]:5054/metrics"`
## Additional Info
On most Linux distributions the `v6only` flag is set to `false` by default (see the section for the `IPV6_V6ONLY` flag in https://www.man7.org/linux/man-pages/man7/ipv6.7.html) which means IPv4 connections will continue to function on a IPv6 address (providing it is appropriately mapped). This means that even if the Lighthouse API is running on `::` it is also possible to accept IPv4 connections.
However on Windows, this is not the case. The `v6only` flag is set to `true` so binding to `::` will only allow IPv6 connections.
## Issue Addressed
Removes the await points in sync waiting for a processor response for rpc block processing. Built on top of #3029
This also handles a couple of bugs in the previous code and adds a relatively comprehensive test suite.
## Issue Addressed
NA
## Proposed Changes
- Bump version to `v2.1.4`
- Run `cargo update`
## Additional Info
I think this release should be published around the 15th of March.
Presently `blocked` for testing on our infrastructure.
## Issue Addressed
This address an issue which was preventing checkpoint-sync.
When the node starts from checkpoint sync, the head block and the finalized block are the same value. We did not respect this when sending a `forkchoiceUpdated` (fcU) call to the EL and were expecting fork choice to hold the *finalized ancestor of the head* and returning an error when it didn't.
This PR uses *only fork choice* for sending fcU updates. This is actually quite nice and avoids some atomicity issues between `chain.canonical_head` and `chain.fork_choice`. Now, whenever `chain.fork_choice.get_head` returns a value we also cache the values required for the next fcU call.
## TODO
- [x] ~~Blocked on #3043~~
- [x] Ensure there isn't a warn message at startup.
## Issue Addressed
Don't send an fcU message at startup if it's pre-genesis. The startup fcU message is not critical, not required by the spec, so it's fine to avoid it for networks that start post-Bellatrix fork.
## Issue Addressed
Resolves#2936
## Proposed Changes
Adds functionality for calling [`validator/prepare_beacon_proposer`](https://ethereum.github.io/beacon-APIs/?urls.primaryName=dev#/Validator/prepareBeaconProposer) in advance.
There is a `BeaconChain::prepare_beacon_proposer` method which, which called, computes the proposer for the next slot. If that proposer has been registered via the `validator/prepare_beacon_proposer` API method, then the `beacon_chain.execution_layer` will be provided the `PayloadAttributes` for us in all future forkchoiceUpdated calls. An artificial forkchoiceUpdated call will be created 4s before each slot, when the head updates and when a validator updates their information.
Additionally, I added strict ordering for calls from the `BeaconChain` to the `ExecutionLayer`. I'm not certain the `ExecutionLayer` will always maintain this ordering, but it's a good start to have consistency from the `BeaconChain`. There are some deadlock opportunities introduced, they are documented in the code.
## Additional Info
- ~~Blocked on #2837~~
Co-authored-by: realbigsean <seananderson33@GMAIL.com>
## Issue Addressed
Resolves#3015
## Proposed Changes
Add JWT token based authentication to engine api requests. The jwt secret key is read from the provided file and is used to sign tokens that are used for authenticated communication with the EL node.
- [x] Interop with geth (synced `merge-devnet-4` with the `merge-kiln-v2` branch on geth)
- [x] Interop with other EL clients (nethermind on `merge-devnet-4`)
- [x] ~Implement `zeroize` for jwt secrets~
- [x] Add auth server tests with `mock_execution_layer`
- [x] Get auth working with the `execution_engine_integration` tests
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
#3010
## Proposed Changes
- move log debounce time latch to `./common/logging`
- add timelatch to limit logging for `attestations_delay_queue` and `queued_block_roots`
## Additional Info
- Is a separate crate for the time latch preferred?
- `elapsed()` could take `LOG_DEBOUNCE_INTERVAL ` as an argument to allow for different granularity.
## Issue Addressed
Addresses spec changes from v1.1.0:
- https://github.com/ethereum/consensus-specs/pull/2830
- https://github.com/ethereum/consensus-specs/pull/2846
## Proposed Changes
* Downgrade the REJECT for `HeadBlockFinalized` to an IGNORE. This applies to both unaggregated and aggregated attestations.
## Additional Info
I thought about also changing the penalty for `UnknownTargetRoot` but I don't think it's reachable in practice.
## Issue Addressed
As discussed on last-night's consensus call, the testnets next week will target the [Kiln Spec v2](https://hackmd.io/@n0ble/kiln-spec).
Presently, we support Kiln V1. V2 is backwards compatible, except for renaming `random` to `prev_randao` in:
- https://github.com/ethereum/execution-apis/pull/180
- https://github.com/ethereum/consensus-specs/pull/2835
With this PR we'll no longer be compatible with the existing Kintsugi and Kiln testnets, however we'll be ready for the testnets next week. I raised this breaking change in the call last night, we are all keen to move forward and break things.
We now target the [`merge-kiln-v2`](https://github.com/MariusVanDerWijden/go-ethereum/tree/merge-kiln-v2) branch for interop with Geth. This required adding the `--http.aauthport` to the tester to avoid a port conflict at startup.
### Changes to exec integration tests
There's some change in the `merge-kiln-v2` version of Geth that means it can't compile on a vanilla Github runner. Bumping the `go` version on the runner solved this issue.
Whilst addressing this, I refactored the `testing/execution_integration` crate to be a *binary* rather than a *library* with tests. This means that we don't need to run the `build.rs` and build Geth whenever someone runs `make lint` or `make test-release`. This is nice for everyday users, but it's also nice for CI so that we can have a specific runner for these tests and we don't need to ensure *all* runners support everything required to build all execution clients.
## More Info
- [x] ~~EF tests are failing since the rename has broken some tests that reference the old field name. I have been told there will be new tests released in the coming days (25/02/22 or 26/02/22).~~
## Issue Addressed
#3006
## Proposed Changes
This PR changes the default behaviour of lighthouse to ignore discovered IPs that are not globally routable. It adds a CLI flag, --enable-local-discovery to permit the non-global IPs in discovery.
NOTE: We should take care in merging this as I will break current set-ups that rely on local IP discovery. I made this the non-default behaviour because we dont really want to be wasting resources attempting to connect to non-routable addresses and we dont want to propagate these to others (on the chance we can connect to one of these local nodes), improving discoveries efficiency.
## Description
This PR adds a single, trivial commit (f5d2b27d78349d5a675a2615eba42cc9ae708094) atop #2986 to resolve a tests compile error. The original author (@ethDreamer) is AFK so I'm getting this one merged ☺️
Please see #2986 for more information about the other, significant changes in this PR.
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
## Issue Addressed
NA
## Proposed Changes
Adds the functionality to allow blocks to be validated/invalidated after their import as per the [optimistic sync spec](https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#how-to-optimistically-import-blocks). This means:
- Updating `ProtoArray` to allow flipping the `execution_status` of ancestors/descendants based on payload validity updates.
- Creating separation between `execution_layer` and the `beacon_chain` by creating a `PayloadStatus` struct.
- Refactoring how the `execution_layer` selects a `PayloadStatus` from the multiple statuses returned from multiple EEs.
- Adding testing framework for optimistic imports.
- Add `ExecutionBlockHash(Hash256)` new-type struct to avoid confusion between *beacon block roots* and *execution payload hashes*.
- Add `merge` to [`FORKS`](c3a793fd73/Makefile (L17)) in the `Makefile` to ensure we test the beacon chain with merge settings.
- Fix some tests here that were failing due to a missing execution layer.
## TODO
- [ ] Balance tests
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
## Proposed Changes
Lots of lint updates related to `flat_map`, `unwrap_or_else` and string patterns. I did a little more creative refactoring in the op pool, but otherwise followed Clippy's suggestions.
## Additional Info
We need this PR to unblock CI.
## Issue Addressed
N/A
## Proposed Changes
Add a HTTP API which can be used to compute the block packing data for all blocks over a discrete range of epochs.
## Usage
### Request
```
curl "http:localhost:5052/lighthouse/analysis/block_packing_efficiency?start_epoch=57730&end_epoch=57732"
```
### Response
```
[
{
"slot": "1847360",
"block_hash": "0xa7dc230659802df2f99ea3798faede2e75942bb5735d56e6bfdc2df335dcd61f",
"proposer_info": {
"validator_index": 1686,
"graffiti": ""
},
"available_attestations": 7096,
"included_attestations": 6459,
"prior_skip_slots": 0
},
...
]
```
## Additional Info
This is notably different to the existing lcli code:
- Uses `BlockReplayer` #2863 and as such runs significantly faster than the previous method.
- Corrects the off-by-one #2878
- Removes the `offline` validators component. This was only a "best guess" and simply was used as a way to determine an estimate of the "true" packing efficiency and was generally not helpful in terms of direct comparisons between different packing methods. As such it has been removed from the API and any future estimates of "offline" validators would be better suited in a separate/more targeted API or as part of 'beacon watch': #2873
- Includes `prior_skip_slots`.
## Issue Addressed
Closes#2880
## Proposed Changes
Support requests to the next epoch in proposer_duties api.
## Additional Info
Implemented with skipping proposer cache for this case because the cache for the future epoch will be missed every new slot as dependent_root is changed and we don't want to "wash it out" by saving additional values.
## Issue Addressed
I noticed in some logs some excess and unecessary discovery queries. What was happening was we were pruning our peers down to our outbound target and having some disconnect. When we are below this threshold we try to find more peers (even if we are at our peer limit). The request becomes futile because we have no more peer slots.
This PR corrects this issue and advances the pruning mechanism to favour subnet peers.
An overview the new logic added is:
- We prune peers down to a target outbound peer count which is higher than the minimum outbound peer count.
- We only search for more peers if there is room to do so, and we are below the minimum outbound peer count not the target. So this gives us some buffer for peers to disconnect. The buffer is currently 10%
The modified pruning logic is documented in the code but for reference it should do the following:
- Prune peers with bad scores first
- If we need to prune more peers, then prune peers that are subscribed to a long-lived subnet
- If we still need to prune peers, the prune peers that we have a higher density of on any given subnet which should drive for uniform peers across all subnets.
This will need a bit of testing as it modifies some significant peer management behaviours in lighthouse.
## Issue Addressed
NA
## Proposed Changes
This PR extends #3018 to address my review comments there and add automated integration tests with Geth (and other implementations, in the future).
I've also de-duplicated the "unused port" logic by creating an `common/unused_port` crate.
## Additional Info
I'm not sure if we want to merge this PR, or update #3018 and merge that. I don't mind, I'm primarily opening this PR to make sure CI works.
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
## Issue Addressed
Closes#3014
## Proposed Changes
- Rename `receipt_root` to `receipts_root`
- Rename `execute_payload` to `notify_new_payload`
- This is slightly weird since we modify everything except the actual HTTP call to the engine API. That change is expected to be implemented in #2985 (cc @ethDreamer)
- Enable "random" tests for Bellatrix.
## Notes
This will break *partially* compatibility with Kintusgi testnets in order to gain compatibility with [Kiln](https://hackmd.io/@n0ble/kiln-spec) testnets. I think it will only break the BN APIs due to the `receipts_root` change, however it might have some other effects too.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
Lighthouse gossiping late messages
## Proposed Changes
Point LH to our fork using tokio interval, which 1) works as expected 2) is more performant than the previous version that actually worked as expected
Upgrade libp2p
## Additional Info
https://github.com/libp2p/rust-libp2p/issues/2497
## Issue Addressed
#2883
## Proposed Changes
* Added `suggested-fee-recipient` & `suggested-fee-recipient-file` flags to validator client (similar to graffiti / graffiti-file implementation).
* Added proposer preparation service to VC, which sends the fee-recipient of all known validators to the BN via [/eth/v1/validator/prepare_beacon_proposer](https://github.com/ethereum/beacon-APIs/pull/178) api once per slot
* Added [/eth/v1/validator/prepare_beacon_proposer](https://github.com/ethereum/beacon-APIs/pull/178) api endpoint and preparation data caching
* Added cleanup routine to remove cached proposer preparations when not updated for 2 epochs
## Additional Info
Changed the Implementation following the discussion in #2883.
Co-authored-by: pk910 <philipp@pk910.de>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Philipp K <philipp@pk910.de>
## Issue Addressed
On a network with few nodes, it is possible that the same node can be found from a subnet discovery and a normal peer discovery at the same time.
The network behaviour loads these peers into events and processes them when it has the chance. It can happen that the same peer can enter the event queue more than once and then attempt to be dialed twice.
This PR shifts the registration of nodes in the peerdb as being dialed before they enter the NetworkBehaviour queue, preventing multiple attempts of the same peer being entered into the queue and avoiding the race condition.
## Issue Addressed
#2947
## Proposed Changes
Store messages that fail to be published due to insufficient peers for retry later. Messages expire after half an epoch and are retried if gossipsub informs us that an useful peer has connected. Currently running in Atlanta
## Additional Info
If on retry sending the messages fails they will not be tried again
## Issue Addressed
NA
## Proposed Changes
Checks to see if attestations or sync messages are still valid before "accepting" them for propagation.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds `STRICT_LATE_MESSAGE_PENALTIES: bool` which allows for toggling penalties for late sync/attn messages.
`STRICT_LATE_MESSAGE_PENALTIES` is set to `false`, since we're seeing a lot of late messages on the network which are causing peer drops. We can toggle the bool during testing to try and figure out what/who is the cause of these late messages.
In effect, this PR *relaxes* peer downscoring for late attns and sync committee messages.
## Additional Info
- ~~Blocked on #2974~~
The gossipsub history was increased to a good portion of a slot from 2.1 seconds in the last release.
Although it shouldn't cause too much issue, it could be related to recieving later messages than usual and interacting with our scoring system penalizing peers. For consistency, this PR reduces the time we gossip messages back to the same values of the previous release.
It also adjusts the gossipsub heartbeat time for testing purposes with a developer flag but this should not effect end users.
## Issue Addressed
This PR fixes the unnecessary `WARN Single block lookup failed` messages described here:
https://github.com/sigp/lighthouse/pull/2866#issuecomment-1008442640
## Proposed Changes
Add a new cache to the `BeaconChain` that tracks the block roots of blocks from before finalization. These could be blocks from the canonical chain (which might need to be read from disk), or old pre-finalization blocks that have been forked out.
The cache also stores a set of block roots for in-progress single block lookups, which duplicates some of the information from sync's `single_block_lookups` hashmap:
a836e180f9/beacon_node/network/src/sync/manager.rs (L192-L196)
On a live node you can confirm that the cache is working by grepping logs for the message: `Rejected attestation to finalized block`.
## Issue Addressed
N/A
## Proposed Changes
Add a HTTP API which can be used to compute the attestation performances of a validator (or all validators) over a discrete range of epochs.
Performances can be computed for a single validator, or for the global validator set.
## Usage
### Request
The API can be used as follows:
```
curl "http://localhost:5052/lighthouse/analysis/attestation_performance/{validator_index}?start_epoch=57730&end_epoch=57732"
```
Alternatively, to compute performances for the global validator set:
```
curl "http://localhost:5052/lighthouse/analysis/attestation_performance/global?start_epoch=57730&end_epoch=57732"
```
### Response
The response is JSON formatted as follows:
```
[
{
"index": 72,
"epochs": {
"57730": {
"active": true,
"head": false,
"target": false,
"source": false
},
"57731": {
"active": true,
"head": true,
"target": true,
"source": true,
"delay": 1
},
"57732": {
"active": true,
"head": true,
"target": true,
"source": true,
"delay": 1
},
}
}
]
```
> Note that the `"epochs"` are not guaranteed to be in ascending order.
## Additional Info
- This API is intended to be used in our upcoming validator analysis tooling (#2873) and will likely not be very useful for regular users. Some advanced users or block explorers may find this API useful however.
- The request range is limited to 100 epochs (since the range is inclusive and it also computes the `end_epoch` it's actually 101 epochs) to prevent Lighthouse using exceptionally large amounts of memory.