## Issue Addressed
NA
## Proposed Changes
Our `ERRO` stream has been rather noisy since the merge due to some unexpected behaviours of builders and EEs. Now that we've been running post-merge for a while, I think we can drop some of these `ERRO` to `WARN` so we're not "crying wolf".
The modified logs are:
#### `ERRO Execution engine call failed`
I'm seeing this quite frequently on Geth nodes. They seem to timeout when they're busy and it rarely indicates a serious issue. We also have logging across block import, fork choice updating and payload production that raise `ERRO` or `CRIT` when the EE times out, so I think we're not at risk of silencing actual issues.
#### `ERRO "Builder failed to reveal payload"`
In #3775 we reduced this log from `CRIT` to `ERRO` since it's common for builders to fail to reveal the block to the producer directly whilst still broadcasting it to the networ. I think it's worth dropping this to `WARN` since it's rarely interesting.
I elected to stay with `WARN` since I really do wish builders would fulfill their API promises by returning the block to us. Perhaps I'm just being pedantic here, I could be convinced otherwise.
#### `ERRO "Relay error when registering validator(s)"`
It seems like builders and/or mev-boost struggle to handle heavy loads of validator registrations. I haven't observed issues with validators not actually being registered, but I see timeouts on these endpoints many times a day. It doesn't seem like this `ERRO` is worth it.
#### `ERRO Error fetching block for peer ExecutionLayerErrorPayloadReconstruction`
This means we failed to respond to a peer on the P2P network with a block they requested because of an error in the `execution_layer`. It's very common to see timeouts or incomplete responses on this endpoint whilst the EE is busy and I don't think it's important enough for an `ERRO`. As long as the peer count stays high, I don't think the user needs to be actively concerned about how we're responding to peers.
## Additional Info
NA
* Add first efforts at broadcast
* Tidy
* Move broadcast code to client
* Progress with broadcast impl
* Rename to address change
* Fix compile errors
* Use `while` loop
* Tidy
* Flip broadcast condition
* Switch to forgetting individual indices
* Always broadcast when the node starts
* Refactor into two functions
* Add testing
* Add another test
* Tidy, add more testing
* Tidy
* Add test, rename enum
* Rename enum again
* Tidy
* Break loop early
* Add V15 schema migration
* Bump schema version
* Progress with migration
* Update beacon_node/client/src/address_change_broadcast.rs
Co-authored-by: Michael Sproul <micsproul@gmail.com>
* Fix typo in function name
---------
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
Currently there is a race between receiving blocks and receiving light client optimistic updates (in unstable), which results in processing errors. This is a continuation of PR #3693 and seeks to progress on issue #3651
## Proposed Changes
Add the parent_root to ReprocessQueueMessage::BlockImported so we can remove blocks from queue when a block arrives that has the same parent root. We use the parent root as opposed to the block_root because the LightClientOptimisticUpdate does not contain the block_root.
If light_client_optimistic_update.attested_header.canonical_root() != head_block.message().parent_root() then we queue the update. Otherwise we process immediately.
## Additional Info
michaelsproul came up with this idea.
The code was heavily based off of the attestation reprocessing.
I have not properly tested this to see if it works as intended.
* Import BLS to execution changes before Capella
* Test for BLS to execution change HTTP API
* Pack BLS to execution changes in LIFO order
* Remove unused var
* Clippy
We recently ran a large-block experiment on the testnet and plan to do a further experiment on mainnet.
Although the metrics recovered from lighthouse nodes were quite useful, I think we could do with greater resolution in the block delay metrics and get some specific values for each block (currently these can be lost to large exponential histogram buckets).
This PR increases the resolution of the block delay histogram buckets, but also introduces a new metric which records the last block delay. Depending on the polling resolution of the metric server, we can lose some block delay information, however it will always give us a specific value and we will not lose exact data based on poor resolution histogram buckets.
## Issue Addressed
While testing withdrawals with @ethDreamer we noticed lighthouse is sending empty batches when an error occurs. As LH peer receiving this, we would consider this a low tolerance action because the peer is claiming the batch is right and is empty.
## Proposed Changes
If any kind of error occurs, send a error response instead
## Additional Info
Right now we don't handle such thing as a partial batch with an error. If an error is received, the whole batch is discarded. Because of this it makes little sense to send partial batches that end with an error, so it's better to do the proposed solution instead of sending empty batches.
## Issue Addressed
Implementing the light_client_gossip topics but I'm not there yet.
Which issue # does this PR address?
Partially #3651
## Proposed Changes
Add light client gossip topics.
Please list or describe the changes introduced by this PR.
I'm going to Implement light_client_finality_update and light_client_optimistic_update gossip topics. Currently I've attempted the former and I'm seeking feedback.
## Additional Info
I've only implemented the light_client_finality_update topic because I wanted to make sure I was on the correct path. Also checking that the gossiped LightClientFinalityUpdate is the same as the locally constructed one is not implemented because caching the updates will make this much easier. Could someone give me some feedback on this please?
Please provide any additional information. For example, future considerations
or information useful for reviewers.
Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
## Issue Addressed
#3704
## Proposed Changes
Adds is_syncing_finalized: bool parameter for block verification functions. Sets the payload_verification_status to Optimistic if is_syncing_finalized is true. Uses SyncState in NetworkGlobals in BeaconProcessor to retrieve the syncing status.
## Additional Info
I could implement FinalizedSignatureVerifiedBlock if you think it would be nicer.
## Issue Addressed
Partially addresses #3651
## Proposed Changes
Adds server-side support for light_client_bootstrap_v1 topic
## Additional Info
This PR, creates each time a bootstrap without using cache, I do not know how necessary a cache is in this case as this topic is not supposed to be called frequently and IMHO we can just prevent abuse by using the limiter, but let me know what you think or if there is any caveat to this, or if it is necessary only for the sake of good practice.
Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
* add capella gossip boiler plate
* get everything compiling
Co-authored-by: realbigsean <sean@sigmaprime.io
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
* small cleanup
* small cleanup
* cargo fix + some test cleanup
* improve block production
* add fixme for potential panic
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
This PR removes duplicated block root computation.
Computing the `SignedBeaconBlock::canonical_root` has become more expensive since the merge as we need to compute the merke root of each transaction inside an `ExecutionPayload`.
Computing the root for [a mainnet block](https://beaconcha.in/slot/4704236) is taking ~10ms on my i7-8700K CPU @ 3.70GHz (no sha extensions). Given that our median seen-to-imported time for blocks is presently 300-400ms, removing a few duplicated block roots (~30ms) could represent an easy 10% improvement. When we consider that the seen-to-imported times include operations *after* the block has been placed in the early attester cache, we could expect the 30ms to be more significant WRT our seen-to-attestable times.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
I have observed scenarios on Goerli where Lighthouse was receiving attestations which reference the same, un-cached shuffling on multiple threads at the same time. Lighthouse was then loading the same state from database and determining the shuffling on multiple threads at the same time. This is unnecessary load on the disk and RAM.
This PR modifies the shuffling cache so that each entry can be either:
- A committee
- A promise for a committee (i.e., a `crossbeam_channel::Receiver`)
Now, in the scenario where we have thread A and thread B simultaneously requesting the same un-cached shuffling, we will have the following:
1. Thread A will take the write-lock on the shuffling cache, find that there's no cached committee and then create a "promise" (a `crossbeam_channel::Sender`) for a committee before dropping the write-lock.
1. Thread B will then be allowed to take the write-lock for the shuffling cache and find the promise created by thread A. It will block the current thread waiting for thread A to fulfill that promise.
1. Thread A will load the state from disk, obtain the shuffling, send it down the channel, insert the entry into the cache and then continue to verify the attestation.
1. Thread B will then receive the shuffling from the receiver, be un-blocked and then continue to verify the attestation.
In the case where thread A fails to generate the shuffling and drops the sender, the next time that specific shuffling is requested we will detect that the channel is disconnected and return a `None` entry for that shuffling. This will cause the shuffling to be re-calculated.
## Additional Info
NA
## Proposed Changes
This PR has two aims: to speed up attestation packing in the op pool, and to fix bugs in the verification of attester slashings, proposer slashings and voluntary exits. The changes are bundled into a single database schema upgrade (v12).
Attestation packing is sped up by removing several inefficiencies:
- No more recalculation of `attesting_indices` during packing.
- No (unnecessary) examination of the `ParticipationFlags`: a bitfield suffices. See `RewardCache`.
- No re-checking of attestation validity during packing: the `AttestationMap` provides attestations which are "correct by construction" (I have checked this using Hydra).
- No SSZ re-serialization for the clunky `AttestationId` type (it can be removed in a future release).
So far the speed-up seems to be roughly 2-10x, from 500ms down to 50-100ms.
Verification of attester slashings, proposer slashings and voluntary exits is fixed by:
- Tracking the `ForkVersion`s that were used to verify each message inside the `SigVerifiedOp`. This allows us to quickly re-verify that they match the head state's opinion of what the `ForkVersion` should be at the epoch(s) relevant to the message.
- Storing the `SigVerifiedOp` on disk rather than the raw operation. This allows us to continue track the fork versions after a reboot.
This is mostly contained in this commit 52bb1840ae5c4356a8fc3a51e5df23ed65ed2c7f.
## Additional Info
The schema upgrade uses the justified state to re-verify attestations and compute `attesting_indices` for them. It will drop any attestations that fail to verify, by the logic that attestations are most valuable in the few slots after they're observed, and are probably stale and useless by the time a node restarts. Exits and proposer slashings and similarly re-verified to obtain `SigVerifiedOp`s.
This PR contains a runtime killswitch `--paranoid-block-proposal` which opts out of all the optimisations in favour of closely verifying every included message. Although I'm quite sure that the optimisations are correct this flag could be useful in the event of an unforeseen emergency.
Finally, you might notice that the `RewardCache` appears quite useless in its current form because it is only updated on the hot-path immediately before proposal. My hope is that in future we can shift calls to `RewardCache::update` into the background, e.g. while performing the state advance. It is also forward-looking to `tree-states` compatibility, where iterating and indexing `state.{previous,current}_epoch_participation` is expensive and needs to be minimised.
## Issue Addressed
Solves #3390
So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.
## Proposed Changes
Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.
Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.
Improves some logging as well.
## Additional Info
We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync
This also fixes a return that reports which batch failed and caused us some confusion checking the logs