## Issue Addressed
N/A
## Proposed Changes
Uses the `penalize_peer` function added in #3350 in sync methods as well. The existing code in sync methods missed the `ExecutionPayloadError::UnverifiedNonOptimisticCandidate` case.
## Issue Addressed
NA
## Proposed Changes
There are scenarios where the only viable head will have an invalid execution payload, in this scenario the `get_head` function on `proto_array` will return an error. We must recover from this scenario by importing blocks from the network.
This PR stops `BeaconChain::recompute_head` from returning an error so that we can't accidentally start down-scoring peers or aborting block import just because the current head has an invalid payload.
## Reviewer Notes
The following changes are included:
1. Allow `fork_choice.get_head` to fail gracefully in `BeaconChain::process_block` when trying to update the `early_attester_cache`; simply don't add the block to the cache rather than aborting the entire process.
1. Don't return an error from `BeaconChain::recompute_head_at_current_slot` and `BeaconChain::recompute_head` to defensively prevent calling functions from aborting any process just because the fork choice function failed to run.
- This should have practically no effect, since most callers were still continuing if recomputing the head failed.
- The outlier is that the API will return 200 rather than a 500 when fork choice fails.
1. Add the `ProtoArrayForkChoice::set_all_blocks_to_optimistic` function to recover from the scenario where we've rebooted and the persisted fork choice has an invalid head.
## Issue Addressed
Resolves#3351
## Proposed Changes
Returns a `ResourceUnavailable` rpc error if we are unable to serve full payloads to blocks by root and range requests because the execution layer is not synced.
## Additional Info
This PR also changes the penalties such that a `ResourceUnavailable` error is only penalized if it is an outgoing request. If we are syncing and aren't getting full block responses, then we don't have use for the peer. However, this might not be true for the incoming request case. We let the peer decide in this case if we are still useful or if we should be banned.
cc @divagant-martian please let me know if i'm missing something here.
## Issue Addressed
Add a flag that optionally enables unrealized vote tracking. Would like to test out on testnets and benchmark differences in methods of vote tracking. This PR includes a DB schema upgrade to enable to new vote tracking style.
Co-authored-by: realbigsean <sean@sigmaprime.io>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: sean <seananderson33@gmail.com>
Co-authored-by: Mac L <mjladson@pm.me>
## Issue Addressed
* #3344
## Proposed Changes
There are a number of cases during block processing where we might get an `ExecutionPayloadError` but we shouldn't penalize peers. We were forgetting to enumerate all of the non-penalizing errors in every single match statement where we are making that decision. I created a function to make it explicit when we should and should not penalize peers and I used that function in all places where this logic is needed. This way we won't make the same mistake if we add another variant of `ExecutionPayloadError` in the future.
## Issue Addressed
Duplicate of #3269. Making this since @divagant-martian opened the previous PR and she can't approve her own PR 😄
Co-authored-by: Diva M <divma@protonmail.com>
## Overview
This rather extensive PR achieves two primary goals:
1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.
Additionally, it achieves:
- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
- I had to do this to deal with sending blocks into spawned tasks.
- Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
- We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
- Avoids cloning *all the blocks* in *every chain segment* during sync.
- It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.
For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273
## Changes to `canonical_head` and `fork_choice`
Previously, the `BeaconChain` had two separate fields:
```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```
Now, we have grouped these values under a single struct:
```
canonical_head: CanonicalHead {
cached_head: RwLock<Arc<Snapshot>>,
fork_choice: RwLock<BeaconForkChoice>
}
```
Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.
## Breaking Changes
### The `state` (root) field in the `finalized_checkpoint` SSE event
Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:
1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.
Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)) it uses [`getStateRootFromBlockRoot`](de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)) which uses (1).
I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.
## Notes for Reviewers
I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.
I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".
I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.
I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.
Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2.
You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.
I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.
Co-authored-by: Mac L <mjladson@pm.me>
## Issue Addressed
Deprecates the step parameter in the blocks by range request
## Proposed Changes
- Modifies the BlocksByRangeRequest type to remove the step parameter and everywhere we took it into account before
- Adds a new type to still handle coding and decoding of requests that use the parameter
## Additional Info
I went with a deprecation over the type itself so that requests received outside `lighthouse_network` don't even need to deal with this parameter. After the deprecation period just removing the Old blocks by range request should be straightforward
## Issue Addressed
Partly resolves https://github.com/sigp/lighthouse/issues/3032
## Proposed Changes
Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work.
Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.
## Issue Addressed
Reduces the effect of late blocks on overall node buildup
## Proposed Changes
change the capacity of the channels used to send work for reprocessing in the beacon processor, and to send back to the main processor task, to be 75% of the capacity of the channel for receiving new events
## Additional Info
The issues we've seen suggest we should still evaluate node performance under stress, with late blocks being a big factor.
Other changes that could help:
1. right now we have a cap for queued attestations for reprocessing that applies to the sum of aggregated and unaggregated attestations. We could consider adding a separate cap that favors aggregated ones.
2. solving #2848
## Issue Addressed
#3212
## Proposed Changes
Move chain segments coming from back-fill syncing from highest priority to lowest
## Additional Info
If this does not solve the issue, next steps would be lowering the batch size for back-fill sync, and as last resort throttling the processing of these chain segments
## Issue Addressed
This fixes the low-hanging Clippy lints introduced in Rust 1.61 (due any hour now). It _ignores_ one lint, because fixing it requires a structural refactor of the validator client that needs to be done delicately. I've started on that refactor and will create another PR that can be reviewed in more depth in the coming days. I think we should merge this PR in the meantime to unblock CI.
# Description
Since the `TaskExecutor` currently requires a `Weak<Runtime>`, it's impossible to use it in an async test where the `Runtime` is created outside our scope. Whilst we *could* create a new `Runtime` instance inside the async test, dropping that `Runtime` would cause a panic (you can't drop a `Runtime` in an async context).
To address this issue, this PR creates the `enum Handle`, which supports either:
- A `Weak<Runtime>` (for use in our production code)
- A `Handle` to a runtime (for use in testing)
In theory, there should be no change to the behaviour of our production code (beyond some slightly different descriptions in HTTP 500 errors), or even our tests. If there is no change, you might ask *"why bother?"*. There are two PRs (#3070 and #3175) that are waiting on these fixes to introduce some new tests. Since we've added the EL to the `BeaconChain` (for the merge), we are now doing more async stuff in tests.
I've also added a `RuntimeExecutor` to the `BeaconChainTestHarness`. Whilst that's not immediately useful, it will become useful in the near future with all the new async testing.
## Proposed Changes
Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.
⚠️ **This is achieved in a backwards-incompatible way for networks that have already merged** ⚠️. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.
The main changes are:
- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
- `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
- `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
- I've tested manually that it works on Kiln, using Geth and Nethermind.
- This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
- We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).
## Additional Info
- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Removes the await points in sync waiting for a processor response for rpc block processing. Built on top of #3029
This also handles a couple of bugs in the previous code and adds a relatively comprehensive test suite.
## Issue Addressed
#3010
## Proposed Changes
- move log debounce time latch to `./common/logging`
- add timelatch to limit logging for `attestations_delay_queue` and `queued_block_roots`
## Additional Info
- Is a separate crate for the time latch preferred?
- `elapsed()` could take `LOG_DEBOUNCE_INTERVAL ` as an argument to allow for different granularity.
## Issue Addressed
Addresses spec changes from v1.1.0:
- https://github.com/ethereum/consensus-specs/pull/2830
- https://github.com/ethereum/consensus-specs/pull/2846
## Proposed Changes
* Downgrade the REJECT for `HeadBlockFinalized` to an IGNORE. This applies to both unaggregated and aggregated attestations.
## Additional Info
I thought about also changing the penalty for `UnknownTargetRoot` but I don't think it's reachable in practice.
## Description
This PR adds a single, trivial commit (f5d2b27d78349d5a675a2615eba42cc9ae708094) atop #2986 to resolve a tests compile error. The original author (@ethDreamer) is AFK so I'm getting this one merged ☺️
Please see #2986 for more information about the other, significant changes in this PR.
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
## Issue Addressed
NA
## Proposed Changes
Checks to see if attestations or sync messages are still valid before "accepting" them for propagation.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds `STRICT_LATE_MESSAGE_PENALTIES: bool` which allows for toggling penalties for late sync/attn messages.
`STRICT_LATE_MESSAGE_PENALTIES` is set to `false`, since we're seeing a lot of late messages on the network which are causing peer drops. We can toggle the bool during testing to try and figure out what/who is the cause of these late messages.
In effect, this PR *relaxes* peer downscoring for late attns and sync committee messages.
## Additional Info
- ~~Blocked on #2974~~
## Issue Addressed
This PR fixes the unnecessary `WARN Single block lookup failed` messages described here:
https://github.com/sigp/lighthouse/pull/2866#issuecomment-1008442640
## Proposed Changes
Add a new cache to the `BeaconChain` that tracks the block roots of blocks from before finalization. These could be blocks from the canonical chain (which might need to be read from disk), or old pre-finalization blocks that have been forked out.
The cache also stores a set of block roots for in-progress single block lookups, which duplicates some of the information from sync's `single_block_lookups` hashmap:
a836e180f9/beacon_node/network/src/sync/manager.rs (L192-L196)
On a live node you can confirm that the cache is working by grepping logs for the message: `Rejected attestation to finalized block`.
## Issue Addressed
- Resolves https://github.com/sigp/lighthouse/issues/2902
## Proposed Changes
As documented in https://github.com/sigp/lighthouse/issues/2902, there are some cases where we will score peers very harshly for sending attestations to an unknown head.
This PR removes the penalty when an attestation for an unknown head is received, queued for block look-up, then popped from the queue without the head block being known. This prevents peers from being penalized for an unknown block when that peer was never actually asked for the block.
Peer penalties should still be applied to the peers who *do* get the request for the block and fail to respond with a valid block. As such, peers who send us attestations to non-existent heads should eventually be booted.
## Additional Info
- [ ] Need to confirm that a timeout for a bbroot request will incur a penalty.
## Issue Addressed
NA
## Proposed Changes
We have observed occasions were under-resourced nodes will receive messages that were valid *at the time*, but later become invalidated due to long waits for a `BeaconProcessor` worker.
In this PR, we will check to see if the message was valid *at the time of receipt*. If it was initially valid but invalid now, we just ignore the message without penalizing the peer.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
I've observed some Prater nodes (and potentially some mainnet nodes) banning peers due to validator pubkey cache lock timeouts. For the `BeaconChainError`-type of errors, they're caused by internal faults and we can't necessarily tell if the peer is bad or not. I think this is causing us to ban peers unnecessarily when running on under-resourced machines.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Introduces a cache to attestation to produce atop blocks which will become the head, but are not fully imported (e.g., not inserted into the database).
Whilst attesting to a block before it's imported is rather easy, if we're going to produce that attestation then we also need to be able to:
1. Verify that attestation.
1. Respond to RPC requests for the `beacon_block_root`.
Attestation verification (1) is *partially* covered. Since we prime the shuffling cache before we insert the block into the early attester cache, we should be fine for all typical use-cases. However, it is possible that the cache is washed out before we've managed to insert the state into the database and then attestation verification will fail with a "missing beacon state"-type error.
Providing the block via RPC (2) is also partially covered, since we'll check the database *and* the early attester cache when responding a blocks-by-root request. However, we'll still omit the block from blocks-by-range requests (until the block lands in the DB). I *think* this is fine, since there's no guarantee that we return all blocks for those responses.
Another important consideration is whether or not the *parent* of the early attester block is available in the databse. If it were not, we might fail to respond to blocks-by-root request that are iterating backwards to collect a chain of blocks. I argue that *we will always have the parent of the early attester block in the database.* This is because we are holding the fork-choice write-lock when inserting the block into the early attester cache and we do not drop that until the block is in the database.
## Issue Addressed
#2834
## Proposed Changes
Change log message severity from error to debug in attestation verification when attestation state is finalized.
## Issue Addressed
New rust lints
## Proposed Changes
- Boxing some enum variants
- removing some unused fields (is the validator lockfile unused? seemed so to me)
## Additional Info
- some error fields were marked as dead code but are logged out in areas
- left some dead fields in our ef test code because I assume they are useful for debugging?
Co-authored-by: realbigsean <seananderson33@gmail.com>
* Add payload verification status to fork choice
* Pass payload verification status to import_block
* Add valid back-propagation
* Add head safety status latch to API
* Remove ExecutionLayerStatus
* Add execution info to client notifier
* Update notifier logs
* Change use of "hash" to refer to beacon block
* Shutdown on invalid finalized block
* Tidy, add comments
* Fix failing FC tests
* Allow blocks with unsafe head
* Fix forkchoiceUpdate call on startup
Added Execution Payload from Rayonism Fork
Updated new Containers to match Merge Spec
Updated BeaconBlockBody for Merge Spec
Completed updating BeaconState and BeaconBlockBody
Modified ExecutionPayload<T> to use Transaction<T>
Mostly Finished Changes for beacon-chain.md
Added some things for fork-choice.md
Update to match new fork-choice.md/fork.md changes
ran cargo fmt
Added Missing Pieces in eth2_libp2p for Merge
fix ef test
Various Changes to Conform Closer to Merge Spec
RPC Responses are for some reason not removing their timeout when they are completing.
As an example:
```
Nov 09 01:18:20.256 DEBG Received BlocksByRange Request step: 1, start_slot: 728465, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.263 DEBG Received BlocksByRange Request step: 1, start_slot: 728593, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.483 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466389, start_slot: 728465, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:20.500 DEBG BlocksByRange Response sent returned: 64, requested: 64, current_slot: 2466389, start_slot: 728593, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:21.068 DEBG Received BlocksByRange Request step: 1, start_slot: 728529, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:21.272 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466389, start_slot: 728529, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:23.434 DEBG Received BlocksByRange Request step: 1, start_slot: 728657, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:23.665 DEBG BlocksByRange Response sent returned: 64, requested: 64, current_slot: 2466390, start_slot: 728657, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:25.851 DEBG Received BlocksByRange Request step: 1, start_slot: 728337, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:25.851 DEBG Received BlocksByRange Request step: 1, start_slot: 728401, count: 64, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:26.094 DEBG BlocksByRange Response sent returned: 62, requested: 64, current_slot: 2466390, start_slot: 728401, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:26.100 DEBG BlocksByRange Response sent returned: 63, requested: 64, current_slot: 2466390, start_slot: 728337, msg: Failed to return all requested blocks, peer: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw
Nov 09 01:18:31.070 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.070 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:31.085 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.085 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:31.459 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:31.459 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:34.129 DEBG RPC Error direction: Incoming, score: 0, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, client: Prysm: version: a80b1c252a9b4773493b41999769bf3134ac373f, os_version: unknown, err: Stream Timeout, protocol: beacon_blocks_by_range, service: libp2p
Nov 09 01:18:34.130 WARN Timed out to a peer's request. Likely insufficient resources, reduce peer count, service: libp2p
Nov 09 01:18:35.686 DEBG Peer Manager disconnecting peer reason: Too many peers, peer_id: 16Uiu2HAmEmBURejquBUMgKAqxViNoPnSptTWLA2CfgSPnnKENBNw, service: libp2p
```
This PR is to investigate and correct the issue.
~~My current thoughts are that for some reason we are not closing the streams correctly, or fast enough, or the executor is not registering the closes and waking up.~~ - Pretty sure this is not the case, see message below for a more accurate reason.
~~I've currently added a timeout to stream closures in an attempt to force streams to close and the future to always complete.~~ I removed this
## Issue Addressed
Getting too many peers kicked due to slightly late sync committee messages as tested on.. under-performant hardware.
## Proposed Changes
Only penalize if the message is more than one slot late. Still ignore the message-
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
## Issue Addressed
Resolves#2611
## Proposed Changes
Adds a duplicate block root cache to the `BeaconProcessor`. Adds the block root to the cache before calling `process_gossip_block` and `process_rpc_block`. Since `process_rpc_block` is called only for single block lookups, we don't have to worry about batched block imports.
The block is imported from the source(gossip/rpc) that arrives first. The block that arrives second is not imported to avoid the db access issue.
There are 2 cases:
1. Block that arrives second is from rpc: In this case, we return an optimistic `BlockError::BlockIsAlreadyKnown` to sync.
2. Block that arrives second is from gossip: In this case, we only do gossip verification and forwarding but don't import the block into the the beacon chain.
## Additional info
Splits up `process_gossip_block` function to `process_gossip_unverified_block` and `process_gossip_verified_block`.
## Description
The `eth2_libp2p` crate was originally named and designed to incorporate a simple libp2p integration into lighthouse. Since its origins the crates purpose has expanded dramatically. It now houses a lot more sophistication that is specific to lighthouse and no longer just a libp2p integration.
As of this writing it currently houses the following high-level lighthouse-specific logic:
- Lighthouse's implementation of the eth2 RPC protocol and specific encodings/decodings
- Integration and handling of ENRs with respect to libp2p and eth2
- Lighthouse's discovery logic, its integration with discv5 and logic about searching and handling peers.
- Lighthouse's peer manager - This is a large module handling various aspects of Lighthouse's network, such as peer scoring, handling pings and metadata, connection maintenance and recording, etc.
- Lighthouse's peer database - This is a collection of information stored for each individual peer which is specific to lighthouse. We store connection state, sync state, last seen ips and scores etc. The data stored for each peer is designed for various elements of the lighthouse code base such as syncing and the http api.
- Gossipsub scoring - This stores a collection of gossipsub 1.1 scoring mechanisms that are continuously analyssed and updated based on the ethereum 2 networks and how Lighthouse performs on these networks.
- Lighthouse specific types for managing gossipsub topics, sync status and ENR fields
- Lighthouse's network HTTP API metrics - A collection of metrics for lighthouse network monitoring
- Lighthouse's custom configuration of all networking protocols, RPC, gossipsub, discovery, identify and libp2p.
Therefore it makes sense to rename the crate to be more akin to its current purposes, simply that it manages the majority of Lighthouse's network stack. This PR renames this crate to `lighthouse_network`
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
NA
## Proposed Changes
This PR is near-identical to https://github.com/sigp/lighthouse/pull/2652, however it is to be merged into `unstable` instead of `merge-f2f`. Please see that PR for reasoning.
I'm making this duplicate PR to merge to `unstable` in an effort to shrink the diff between `unstable` and `merge-f2f` by doing smaller, lead-up PRs.
## Additional Info
NA
## Issue Addressed
Closes#2528
## Proposed Changes
- Add `BlockTimesCache` to provide block timing information to `BeaconChain`. This allows additional metrics to be calculated for blocks that are set as head too late.
- Thread the `seen_timestamp` of blocks received from RPC responses (except blocks from syncing) through to the sync manager, similar to what is done for blocks from gossip.
## Additional Info
This provides the following additional metrics:
- `BEACON_BLOCK_OBSERVED_SLOT_START_DELAY_TIME`
- The delay between the start of the slot and when the block was first observed.
- `BEACON_BLOCK_IMPORTED_OBSERVED_DELAY_TIME`
- The delay between when the block was first observed and when the block was imported.
- `BEACON_BLOCK_HEAD_IMPORTED_DELAY_TIME`
- The delay between when the block was imported and when the block was set as head.
The metric `BEACON_BLOCK_IMPORTED_SLOT_START_DELAY_TIME` was removed.
A log is produced when a block is set as head too late, e.g.:
```
Aug 27 03:46:39.006 DEBG Delayed head block set_as_head_delay: Some(21.731066ms), imported_delay: Some(119.929934ms), observed_delay: Some(3.864596988s), block_delay: 4.006257988s, slot: 1931331, proposer_index: 24294, block_root: 0x937602c89d3143afa89088a44bdf4b4d0d760dad082abacb229495c048648a9e, service: beacon
```
## Issue Addressed
NA
## Proposed Changes
Adds the ability to verify batches of aggregated/unaggregated attestations from the network.
When the `BeaconProcessor` finds there are messages in the aggregated or unaggregated attestation queues, it will first check the length of the queue:
- `== 1` verify the attestation individually.
- `>= 2` take up to 64 of those attestations and verify them in a batch.
Notably, we only perform batch verification if the queue has a backlog. We don't apply any artificial delays to attestations to try and force them into batches.
### Batching Details
To assist with implementing batches we modify `beacon_chain::attestation_verification` to have two distinct categories for attestations:
- *Indexed* attestations: those which have passed initial validation and were valid enough for us to derive an `IndexedAttestation`.
- *Verified* attestations: those attestations which were indexed *and also* passed signature verification. These are well-formed, interesting messages which were signed by validators.
The batching functions accept `n` attestations and then return `n` attestation verification `Result`s, where those `Result`s can be any combination of `Ok` or `Err`. In other words, we attempt to verify as many attestations as possible and return specific per-attestation results so peer scores can be updated, if required.
When we batch verify attestations, we first try to map all those attestations to *indexed* attestations. If any of those attestations were able to be indexed, we then perform batch BLS verification on those indexed attestations. If the batch verification succeeds, we convert them into *verified* attestations, disabling individual signature checking. If the batch fails, we convert to verified attestations with individual signature checking enabled.
Ultimately, we optimistically try to do a batch verification of attestation signatures and fall-back to individual verification if it fails. This opens an attach vector for "poisoning" the attestations and causing us to waste a batch verification. I argue that peer scoring should do a good-enough job of defending against this and the typical-case gains massively outweigh the worst-case losses.
## Additional Info
Before this PR, attestation verification took the attestations by value (instead of by reference). It turns out that this was unnecessary and, in my opinion, resulted in some undesirable ergonomics (e.g., we had to pass the attestation back in the `Err` variant to avoid clones). In this PR I've modified attestation verification so that it now takes a reference.
I refactored the `beacon_chain/tests/attestation_verification.rs` tests so they use a builder-esque "tester" struct instead of a weird macro. It made it easier for me to test individual/batch with the same set of tests and I think it was a nice tidy-up. Notably, I did this last to try and make sure my new refactors to *actual* production code would pass under the existing test suite.
## Issue Addressed
Closes#1891Closes#1784
## Proposed Changes
Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.
## Additional Info
- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.
Considered out of scope for this PR:
- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
N/A
## Proposed Changes
Add functionality in the validator monitor to provide sync committee related metrics for monitored validators.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
A Discord user presented logs which indicated a drop in their peer count caused by a variety of peers sending attestations where we'd already seen an attestation for that validator. It's presently unclear how this case came about, but during our investigation I noticed that we are down-voting peers for sending such attestations.
There are three scenarios where we may receive duplicate unagg. attestations from the same validator:
1. The validator is committing a slashable offense.
2. The gossipsub message-deduping functionality is not working as expected.
3. We received the message via the HTTP prior to seeing it via gossip.
Scenario (1) would be so costly for an attacker that I don't think we need to add DoS protection for it.
Scenario (2) seems feasible. Our "seen message" caches in gossipsub might fill up/expire and let through these duplicates. There are also cases involving message ID mismatches with the other peers. In both these cases, I don't think we should be doing 1 attestation == -1 point down-voting.
Scenario (3) is not necessarily a fault of the peer and we shouldn't down-score them for it.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Version bump
- Increase queue sizes for aggregated attestations and re-queued attestations.
## Additional Info
NA
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
- Add a counter metric to log when a block is received late from gossip.
- Also push a `DEBG` log for the above condition.
- Use Debug (`?`) instead of Display (`%`) for a bunch of logs in the beacon processor, so we don't have to deal with concatenated block roots.
- Add new ERRO and CRIT to HTTP API to alert users when they're publishing late blocks.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Bump to `v1.5.0-rc.0`.
- Increase attestation reprocessing queue size (I saw this filling up on Prater).
- Reduce error log for full attn reprocessing queue to warn.
## TODO
- [x] Manual testing
- [x] Resolve https://github.com/sigp/lighthouse/pull/2493
- [x] Include https://github.com/sigp/lighthouse/pull/2501
## Proposed Changes
* Implement the validator client and HTTP API changes necessary to support Altair
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>