Commit Graph

696 Commits

Author SHA1 Message Date
Michael Sproul
775d222299 Enable proposer boost re-orging (#2860)
## Proposed Changes

With proposer boosting implemented (#2822) we have an opportunity to re-org out late blocks.

This PR adds three flags to the BN to control this behaviour:

* `--disable-proposer-reorgs`: turn aggressive re-orging off (it's on by default).
* `--proposer-reorg-threshold N`: attempt to orphan blocks with less than N% of the committee vote. If this parameter isn't set then N defaults to 20% when the feature is enabled.
* `--proposer-reorg-epochs-since-finalization N`: only attempt to re-org late blocks when the number of epochs since finalization is less than or equal to N. The default is 2 epochs, meaning re-orgs will only be attempted when the chain is finalizing optimally.

For safety Lighthouse will only attempt a re-org under very specific conditions:

1. The block being proposed is 1 slot after the canonical head, and the canonical head is 1 slot after its parent. i.e. at slot `n + 1` rather than building on the block from slot `n` we build on the block from slot `n - 1`.
2. The current canonical head received less than N% of the committee vote. N should be set depending on the proposer boost fraction itself, the fraction of the network that is believed to be applying it, and the size of the largest entity that could be hoarding votes.
3. The current canonical head arrived after the attestation deadline from our perspective. This condition was only added to support suppression of forkchoiceUpdated messages, but makes intuitive sense.
4. The block is being proposed in the first 2 seconds of the slot. This gives it time to propagate and receive the proposer boost.


## Additional Info

For the initial idea and background, see: https://github.com/ethereum/consensus-specs/pull/2353#issuecomment-950238004

There is also a specification for this feature here: https://github.com/ethereum/consensus-specs/pull/3034

Co-authored-by: Michael Sproul <micsproul@gmail.com>
Co-authored-by: pawan <pawandhananjay@gmail.com>
2022-12-13 09:57:26 +00:00
Paul Hauner
6f79263a21 Make all validator monitor logs INFO (#3727)
## Issue Addressed

NA

## Proposed Changes

This is a *potentially* contentious change, but I find it annoying that the validator monitor logs `WARN` and `ERRO` for imperfect attestations. Perfect attestation performance is unachievable (don't believe those photo-shopped beauty magazines!) since missed and poorly-packed blocks by other validators will reduce your performance.

When the validator monitor is on with 10s or more validators, I find the logs are washed out with ERROs that are not worth investigating. I suspect that users who really want to know if validators are missing attestations can do so by matching the content of the log, rather than the log level.

I'm open to feedback about this, especially from anyone who is relying on the current log levels.

## Additional Info

NA

## Breaking Changes Notes

The validator monitor will no longer emit `WARN` and `ERRO` logs for sub-optimal attestation performance. The logs will now be emitted at `INFO` level. This change was introduced to avoid cluttering the `WARN` and `ERRO` logs with alerts that are frequently triggered by the actions of other network participants (e.g., a missed block) and require no action from the user.
2022-12-13 06:24:52 +00:00
GeemoCandama
1b28ef8a8d Adding light_client gossip topics (#3693)
## Issue Addressed
Implementing the light_client_gossip topics but I'm not there yet.

Which issue # does this PR address?
Partially #3651

## Proposed Changes
Add light client gossip topics.
Please list or describe the changes introduced by this PR.
I'm going to Implement light_client_finality_update and light_client_optimistic_update gossip topics. Currently I've attempted the former and I'm seeking feedback.

## Additional Info
I've only implemented the light_client_finality_update topic because I wanted to make sure I was on the correct path. Also checking that the gossiped LightClientFinalityUpdate is the same as the locally constructed one is not implemented because caching the updates will make this much easier. Could someone give me some feedback on this please? 

Please provide any additional information. For example, future considerations
or information useful for reviewers.

Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
2022-12-13 06:24:51 +00:00
Mac L
8cb9b5e126 Expose certain validator_monitor metrics to the HTTP API (#3760)
## Issue Addressed

#3724 

## Proposed Changes

Exposes certain `validator_monitor` as an endpoint on the HTTP API. Will only return metrics for validators which are actively being monitored.

### Usage

```bash
curl -X GET "http://localhost:5052/lighthouse/ui/validator_metrics" -H "accept: application/json" | jq
```

```json
{
  "data": {
    "validators": {
      "12345": {
        "attestation_hits": 10,
        "attestation_misses": 0,
        "attestation_hit_percentage": 100,
        "attestation_head_hits": 10,
        "attestation_head_misses": 0,
        "attestation_head_hit_percentage": 100,
        "attestation_target_hits": 5,
        "attestation_target_misses": 5,
        "attestation_target_hit_percentage": 50 
      }
    }
  }
}
```

## Additional Info

Based on #3756 which should be merged first.
2022-12-09 06:39:19 +00:00
Michael Sproul
84392d63fa Delete DB schema migrations for v11 and earlier (#3761)
## Proposed Changes

Now that the Gnosis merge is scheduled, all users should have upgraded beyond Lighthouse v3.0.0. Accordingly we can delete schema migrations for versions prior to v3.0.0.

## Additional Info

I also deleted the state cache stuff I added in #3714 as it turned out to be useless for the light client proofs due to the one-slot offset.
2022-12-02 00:07:43 +00:00
Michael Sproul
22115049ee Prioritise important parts of block processing (#3696)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/2327

## Proposed Changes

This is an extension of some ideas I implemented while working on `tree-states`:

- Cache the indexed attestations from blocks in the `ConsensusContext`. Previously we were re-computing them 3-4 times over.
- Clean up `import_block` by splitting each part into `import_block_XXX`.
- Move some stuff off hot paths, specifically:
  - Relocate non-essential tasks that were running between receiving the payload verification status and priming the early attester cache. These tasks are moved after the cache priming:
    - Attestation observation
    - Validator monitor updates
    - Slasher updates
    - Updating the shuffling cache
  - Fork choice attestation observation now happens at the end of block verification in parallel with payload verification (this seems to save 5-10ms).
  - Payload verification now happens _before_ advancing the pre-state and writing it to disk! States were previously being written eagerly and adding ~20-30ms in front of verifying the execution payload. State catchup also sometimes takes ~500ms if we get a cache miss and need to rebuild the tree hash cache.

The remaining task that's taking substantial time (~20ms) is importing the block to fork choice. I _think_ this is because of pull-tips, and we should be able to optimise it out with a clever total active balance cache in the state (which would be computed in parallel with payload verification). I've decided to leave that for future work though. For now it can be observed via the new `beacon_block_processing_post_exec_pre_attestable_seconds` metric.


Co-authored-by: Michael Sproul <micsproul@gmail.com>
2022-11-30 05:22:58 +00:00
GeemoCandama
3534c85e30 Optimize finalized chain sync by skipping newPayload messages (#3738)
## Issue Addressed

#3704 

## Proposed Changes
Adds is_syncing_finalized: bool parameter for block verification functions. Sets the payload_verification_status to Optimistic if is_syncing_finalized is true. Uses SyncState in NetworkGlobals in BeaconProcessor to retrieve the syncing status.

## Additional Info
I could implement FinalizedSignatureVerifiedBlock if you think it would be nicer.
2022-11-29 08:19:27 +00:00
Giulio rebuffo
d5a2de759b Added LightClientBootstrap V1 (#3711)
## Issue Addressed

Partially addresses #3651

## Proposed Changes

Adds server-side support for light_client_bootstrap_v1 topic

## Additional Info

This PR, creates each time a bootstrap without using cache, I do not know how necessary a cache is in this case as this topic is not supposed to be called frequently and IMHO we can just prevent abuse by using the limiter, but let me know what you think or if there is any caveat to this, or if it is necessary only for the sake of good practice.


Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2022-11-25 05:19:00 +00:00
Michael Sproul
3be41006a6 Add --light-client-server flag and state cache utils (#3714)
## Issue Addressed

Part of https://github.com/sigp/lighthouse/issues/3651.

## Proposed Changes

Add a flag for enabling the light client server, which should be checked before gossip/RPC traffic is processed (e.g. https://github.com/sigp/lighthouse/pull/3693, https://github.com/sigp/lighthouse/pull/3711). The flag is available at runtime from `beacon_chain.config.enable_light_client_server`.

Additionally, a new method `BeaconChain::with_mutable_state_for_block` is added which I envisage being used for computing light client updates. Unfortunately its performance will be quite poor on average because it will only run quickly with access to the tree hash cache. Each slot the tree hash cache is only available for a brief window of time between the head block being processed and the state advance at 9s in the slot. When the state advance happens the cache is moved and mutated to get ready for the next slot, which makes it no longer useful for merkle proofs related to the head block. Rather than spend more time trying to optimise this I think we should continue prototyping with this code, and I'll make sure `tree-states` is ready to ship before we enable the light client server in prod (cf. https://github.com/sigp/lighthouse/pull/3206).

## Additional Info

I also fixed a bug in the implementation of `BeaconState::compute_merkle_proof` whereby the tree hash cache was moved with `.take()` but never put back with `.restore()`.
2022-11-11 11:03:18 +00:00
GeemoCandama
c591fcd201 add checkpoint-sync-url-timeout flag (#3710)
## Issue Addressed
#3702 
Which issue # does this PR address?
#3702
## Proposed Changes
Added checkpoint-sync-url-timeout flag to cli. Added timeout field to ClientGenesis::CheckpointSyncUrl to utilize timeout set

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.


Co-authored-by: GeemoCandama <104614073+GeemoCandama@users.noreply.github.com>
Co-authored-by: Michael Sproul <micsproul@gmail.com>
2022-11-11 00:38:28 +00:00
tim gretler
266d765285 Register blocks in validator monitor (#3635)
## Issue Addressed

Closes #3460

## Proposed Changes

`blocks` and `block_min_delay` are never updated in the epoch summary



Co-authored-by: Michael Sproul <micsproul@gmail.com>
2022-11-09 05:37:09 +00:00
Divma
8600645f65 Fix rust 1.65 lints (#3682)
## Issue Addressed

New lints for rust 1.65

## Proposed Changes

Notable change is the identification or parameters that are only used in recursion

## Additional Info
na
2022-11-04 07:43:43 +00:00
ethDreamer
e8604757a2 Deposit Cache Finalization & Fast WS Sync (#2915)
## Summary

The deposit cache now has the ability to finalize deposits. This will cause it to drop unneeded deposit logs and hashes in the deposit Merkle tree that are no longer required to construct deposit proofs. The cache is finalized whenever the latest finalized checkpoint has a new `Eth1Data` with all deposits imported.

This has three benefits:

1. Improves the speed of constructing Merkle proofs for deposits as we can just replay deposits since the last finalized checkpoint instead of all historical deposits when re-constructing the Merkle tree.
2. Significantly faster weak subjectivity sync as the deposit cache can be transferred to the newly syncing node in compressed form. The Merkle tree that stores `N` finalized deposits requires a maximum of `log2(N)` hashes. The newly syncing node then only needs to download deposits since the last finalized checkpoint to have a full tree.
3. Future proofing in preparation for [EIP-4444](https://eips.ethereum.org/EIPS/eip-4444) as execution nodes will no longer be required to store logs permanently so we won't always have all historical logs available to us.

## More Details

Image to illustrate how the deposit contract merkle tree evolves and finalizes along with the resulting `DepositTreeSnapshot`
![image](https://user-images.githubusercontent.com/37123614/151465302-5fc56284-8a69-4998-b20e-45db3934ac70.png)

## Other Considerations

I've changed the structure of the `SszDepositCache` so once you load & save your database from this version of lighthouse, you will no longer be able to load it from older versions.

Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
2022-10-30 04:04:24 +00:00
Michael Sproul
77eabc5401 Revert "Optimise HTTP validator lookups" (#3658)
## Issue Addressed

This reverts commit ca9dc8e094 (PR #3559) with some modifications.

## Proposed Changes

Unfortunately that PR introduced a performance regression in fork choice. The optimisation _intended_ to build the exit and pubkey caches on the head state _only if_ they were not already built. However, due to the head state always being cloned without these caches, we ended up building them every time the head changed, leading to a ~70ms+ penalty on mainnet.

fcfd02aeec/beacon_node/beacon_chain/src/canonical_head.rs (L633-L636)

I believe this is a severe enough regression to justify immediately releasing v3.2.1 with this change.

## Additional Info

I didn't fully revert #3559, because there were some unrelated deletions of dead code in that PR which I figured we may as well keep.

An alternative would be to clone the extra caches, but this likely still imposes some cost, so in the interest of applying a conservative fix quickly, I think reversion is the best approach. The optimisation from #3559 was not even optimising a particularly significant path, it was mostly for VCs running larger numbers of inactive keys. We can re-do it in the `tree-states` world where cache clones are cheap.
2022-10-26 06:50:04 +00:00
Michael Sproul
edf23bb40e Fix attestation shuffling filter (#3629)
## Issue Addressed

Fix a bug in block production that results in blocks with 0 attestations during the first slot of an epoch.

The bug is marked by debug logs of the form:

> DEBG Discarding attestation because of missing ancestor, block_root: 0x3cc00d9c9e0883b2d0db8606278f2b8423d4902f9a1ee619258b5b60590e64f8, pivot_slot: 4042591

It occurs when trying to look up the shuffling decision root for an attestation from a slot which is prior to fork choice's finalized block. This happens frequently when proposing in the first slot of the epoch where we have:

- `current_epoch == n`
- `attestation.data.target.epoch == n - 1`
- attestation shuffling epoch `== n - 3` (decision block being the last block of `n - 3`)
- `state.finalized_checkpoint.epoch == n - 2` (first block of `n - 2` is finalized)

Hence the shuffling decision slot is out of range of the fork choice backwards iterator _by a single slot_.

Unfortunately this bug was hidden when we weren't pruning fork choice, and then reintroduced in v2.5.1 when we fixed the pruning (https://github.com/sigp/lighthouse/releases/tag/v2.5.1). There's no way to turn that off or disable the filtering in our current release, so we need a new release to fix this issue.

Fortunately, it also does not occur on every epoch boundary because of the gradual pruning of fork choice every 256 blocks (~8 epochs):

01e84b71f5/consensus/proto_array/src/proto_array_fork_choice.rs (L16)

01e84b71f5/consensus/proto_array/src/proto_array.rs (L713-L716)

So the probability of proposing a 0-attestation block given a proposal assignment is approximately `1/32 * 1/8 = 0.39%`.

## Proposed Changes

- Load the block's shuffling ID from fork choice and verify it against the expected shuffling ID of the head state. This code was initially written before we had settled on a representation of shuffling IDs, so I think it's a nice simplification to make use of them here rather than more ad-hoc logic that fundamentally does the same thing.

## Additional Info

Thanks to @moshe-blox for noticing this issue and bringing it to our attention.
2022-10-18 04:02:06 +00:00
Michael Sproul
59ec6b71b8 Consensus context with proposer index caching (#3604)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/2371

## Proposed Changes

Backport some changes from `tree-states` that remove duplicated calculations of the `proposer_index`.

With this change the proposer index should be calculated only once for each block, and then plumbed through to every place it is required.

## Additional Info

In future I hope to add more data to the consensus context that is cached on a per-epoch basis, like the effective balances of validators and the base rewards.

There are some other changes to remove indexing in tests that were also useful for `tree-states` (the `tree-states` types don't implement `Index`).
2022-10-15 22:25:54 +00:00
Michael Sproul
ca9dc8e094 Optimise HTTP validator lookups (#3559)
## Issue Addressed

While digging around in some logs I noticed that queries for validators by pubkey were taking 10ms+, which seemed too long. This was due to a loop through the entire validator registry for each lookup.

## Proposed Changes

Rather than using a loop through the register, this PR utilises the pubkey cache which is usually initialised at the head*. In case the cache isn't built, we fall back to the previous loop logic. In the vast majority of cases I expect the cache will be built, as the validator client queries at the `head` where all caches should be built.

## Additional Info

*I had to modify the cache build that runs after fork choice to build the pubkey cache. I think it had been optimised out, perhaps accidentally. I think it's preferable to have the exit cache and the pubkey cache built on the head state, as they are required for verifying deposits and exits respectively, and we may as well build them off the hot path of block processing. Previously they'd get built the first time a deposit or exit needed to be verified.

I've deleted the unused `map_state` function which was obsoleted by `map_state_and_execution_optimistic`.
2022-10-15 22:25:51 +00:00
Pawan Dhananjay
8728c40102 Remove fallback support from eth1 service (#3594)
## Issue Addressed

N/A

## Proposed Changes

With https://github.com/sigp/lighthouse/pull/3214 we made it such that you can either have 1 auth endpoint or multiple non auth endpoints. Now that we are post merge on all networks (testnets and mainnet), we cannot progress a chain without a dedicated auth execution layer connection so there is no point in having a non-auth eth1-endpoint for syncing deposit cache. 

This code removes all fallback related code in the eth1 service. We still keep the single non-auth endpoint since it's useful for testing.

## Additional Info

This removes all eth1 fallback related metrics that were relevant for the monitoring service, so we might need to change the api upstream.
2022-10-04 08:33:39 +00:00
Age Manning
01b6bf7a2d Improve logging a little (#3619)
Some of the logs in combination with others could be improved. 

It will save some time debugging by improving the wording slightly.
2022-09-29 01:50:12 +00:00
Divma
bd873e7162 New rust lints for rustc 1.64.0 (#3602)
## Issue Addressed
fixes lints from the last rust release

## Proposed Changes
Fix the lints, most of the lints by `clippy::question-mark` are false positives in the form of https://github.com/rust-lang/rust-clippy/issues/9518 so it's allowed for now

## Additional Info
2022-09-23 03:52:46 +00:00
Paul Hauner
fa6ad1a11a Deduplicate block root computation (#3590)
## Issue Addressed

NA

## Proposed Changes

This PR removes duplicated block root computation.

Computing the `SignedBeaconBlock::canonical_root` has become more expensive since the merge as we need to compute the merke root of each transaction inside an `ExecutionPayload`.

Computing the root for [a mainnet block](https://beaconcha.in/slot/4704236) is taking ~10ms on my i7-8700K CPU @ 3.70GHz (no sha extensions). Given that our median seen-to-imported time for blocks is presently 300-400ms, removing a few duplicated block roots (~30ms) could represent an easy 10% improvement. When we consider that the seen-to-imported times include operations *after* the block has been placed in the early attester cache, we could expect the 30ms to be more significant WRT our seen-to-attestable times.

## Additional Info

NA
2022-09-23 03:52:42 +00:00
Paul Hauner
96692b8e43 Impl oneshot_broadcast for committee promises (#3595)
## Issue Addressed

NA

## Proposed Changes

Fixes an issue introduced in #3574 where I erroneously assumed that a `crossbeam_channel` multiple receiver queue was a *broadcast* queue. This is incorrect, each message will be received by *only one* receiver. The effect of this mistake is these logs:

```
Sep 20 06:56:17.001 INFO Synced                                  slot: 4736079, block: 0xaa8a…180d, epoch: 148002, finalized_epoch: 148000, finalized_root: 0x2775…47f2, exec_hash: 0x2ca5…ffde (verified), peers: 6, service: slot_notifier
Sep 20 06:56:23.237 ERRO Unable to validate attestation          error: CommitteeCacheWait(RecvError), peer_id: 16Uiu2HAm2Jnnj8868tb7hCta1rmkXUf5YjqUH1YPj35DCwNyeEzs, type: "aggregated", slot: Slot(4736047), beacon_block_root: 0x88d318534b1010e0ebd79aed60b6b6da1d70357d72b271c01adf55c2b46206c1
```

## Additional Info

NA
2022-09-21 01:01:50 +00:00
Paul Hauner
a95bcba2ab Avoid holding write-lock whilst waiting on shuffling cache promise (#3589)
## Issue Addressed

NA

## Proposed Changes

Fixes a bug which hogged the write-lock for the `shuffling_cache`.

## Additional Info

NA
2022-09-19 07:58:50 +00:00
Michael Sproul
507bb9dad4 Refined payload pruning (#3587)
## Proposed Changes

Improve the payload pruning feature in several ways:

- Payload pruning is now entirely optional. It is enabled by default but can be disabled with `--prune-payloads false`. The previous `--prune-payloads-on-startup` flag from #3565 is removed.
- Initial payload pruning on startup now runs in a background thread. This thread will always load the split state, which is a small fraction of its total work (up to ~300ms) and then backtrack from that state. This pruning process ran in 2m5s on one Prater node with good I/O and 16m on a node with slower I/O.
- To work with the optional payload pruning the database function `try_load_full_block` will now attempt to load execution payloads for finalized slots _if_ pruning is currently disabled. This gives users an opt-out for the extensive traffic between the CL and EL for reconstructing payloads.

## Additional Info

If the `prune-payloads` flag is toggled on and off then the on-startup check may not see any payloads to delete and fail to clean them up. In this case the `lighthouse db prune_payloads` command should be used to force a manual sweep of the database.
2022-09-19 07:58:49 +00:00
Michael Sproul
ca42ef2e5a Prune finalized execution payloads (#3565)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/3556

## Proposed Changes

Delete finalized execution payloads from the database in two places:

1. When running the finalization migration in `migrate_database`. We delete the finalized payloads between the last split point and the new updated split point. _If_ payloads are already pruned prior to this then this is sufficient to prune _all_ payloads as non-canonical payloads are already deleted by the head pruner, and all canonical payloads prior to the previous split will already have been pruned.
2. To address the fact that users will update to this code _after_ the merge on mainnet (and testnets), we need a one-off scan to delete the finalized payloads from the canonical chain. This is implemented in `try_prune_execution_payloads` which runs on startup and scans the chain back to the Bellatrix fork or the anchor slot (if checkpoint synced after Bellatrix). In the case where payloads are already pruned this check only imposes a single state load for the split state, which shouldn't be _too slow_. Even so, a flag `--prepare-payloads-on-startup=false` is provided to turn this off after it has run the first time, which provides faster start-up times.

There is also a new `lighthouse db prune_payloads` subcommand for users who prefer to run the pruning manually.

## Additional Info

The tests have been updated to not rely on finalized payloads in the database, instead using the `MockExecutionLayer` to reconstruct them. Additionally a check was added to `check_chain_dump` which asserts the non-existence or existence of payloads on disk depending on their slot.
2022-09-17 02:27:01 +00:00
Paul Hauner
2cd3e3a768 Avoid duplicate committee cache loads (#3574)
## Issue Addressed

NA

## Proposed Changes

I have observed scenarios on Goerli where Lighthouse was receiving attestations which reference the same, un-cached shuffling on multiple threads at the same time. Lighthouse was then loading the same state from database and determining the shuffling on multiple threads at the same time. This is unnecessary load on the disk and RAM.

This PR modifies the shuffling cache so that each entry can be either:

- A committee
- A promise for a committee (i.e., a `crossbeam_channel::Receiver`)

Now, in the scenario where we have thread A and thread B simultaneously requesting the same un-cached shuffling, we will have the following:

1. Thread A will take the write-lock on the shuffling cache, find that there's no cached committee and then create a "promise" (a `crossbeam_channel::Sender`) for a committee before dropping the write-lock.
1. Thread B will then be allowed to take the write-lock for the shuffling cache and find the promise created by thread A. It will block the current thread waiting for thread A to fulfill that promise.
1. Thread A will load the state from disk, obtain the shuffling, send it down the channel, insert the entry into the cache and then continue to verify the attestation.
1. Thread B will then receive the shuffling from the receiver, be un-blocked and then continue to verify the attestation.

In the case where thread A fails to generate the shuffling and drops the sender, the next time that specific shuffling is requested we will detect that the channel is disconnected and return a `None` entry for that shuffling. This will cause the shuffling to be re-calculated.

## Additional Info

NA
2022-09-16 08:54:03 +00:00
Paul Hauner
7d3948c8fe Add metric for re-org distance (#3566)
## Issue Addressed

NA

## Proposed Changes

Add a metric to track the re-org distance.

## Additional Info

NA
2022-09-13 17:19:27 +00:00
tim gretler
98815516a1 Support histogram buckets (#3391)
## Issue Addressed

#3285

## Proposed Changes

Adds support for specifying histogram with buckets and adds new metric buckets for metrics mentioned in issue.

## Additional Info

Need some help for the buckets.


Co-authored-by: Michael Sproul <micsproul@gmail.com>
2022-09-13 01:57:44 +00:00
Nils Effinghausen
f682df51a1 fix description for BALANCES_CACHE_MISSES metric (#3545)
## Issue Addressed

fixes metric description


Co-authored-by: Nils Effinghausen <nils.effinghausen@t-systems.com>
2022-09-10 01:35:10 +00:00
realbigsean
cae40731a2 Strict count unrealized (#3522)
## Issue Addressed

Add a flag that can increase count unrealized strictness, defaults to false

## Proposed Changes

Please list or describe the changes introduced by this PR.

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: sean <seananderson33@gmail.com>
2022-09-05 04:50:47 +00:00
Paul Hauner
1a833ecc17 Add more logging for invalid payloads (#3515)
## Issue Addressed

NA

## Proposed Changes

Adds more `debug` logging to help troubleshoot invalid execution payload blocks. I was doing some of this recently and found it to be challenging.

With this PR we should be able to grep `Invalid execution payload` and get one-liners that will show the block, slot and details about the proposer.

I also changed the log in `process_invalid_execution_payload` since it was a little misleading; the `block_root` wasn't necessary the block which had an invalid payload.

## Additional Info

NA
2022-08-29 14:34:42 +00:00
Paul Hauner
8609cced0e Reset payload statuses when resuming fork choice (#3498)
## Issue Addressed

NA

## Proposed Changes

This PR is motivated by a recent consensus failure in Geth where it returned `INVALID` for an `VALID` block. Without this PR, the only way to recover is by re-syncing Lighthouse. Whilst ELs "shouldn't have consensus failures", in reality it's something that we can expect from time to time due to the complex nature of Ethereum. Being able to recover easily will help the network recover and EL devs to troubleshoot.

The risk introduced with this PR is that genuinely INVALID payloads get a "second chance" at being imported. I believe the DoS risk here is negligible since LH needs to be restarted in order to re-process the payload. Furthermore, there's no reason to think that a well-performing EL will accept a truly invalid payload the second-time-around.

## Additional Info

This implementation has the following intricacies:

1. Instead of just resetting *invalid* payloads to optimistic, we'll also reset *valid* payloads. This is an artifact of our existing implementation.
1. We will only reset payload statuses when we detect an invalid payload present in `proto_array`
    - This helps save us from forgetting that all our blocks are valid in the "best case scenario" where there are no invalid blocks.
1. If we fail to revert the payload statuses we'll log a `CRIT` and just continue with a `proto_array` that *does not* have reverted payload statuses.
    - The code to revert statuses needs to deal with balances and proposer-boost, so it's a failure point. This is a defensive measure to avoid introducing new show-stopping bugs to LH.
2022-08-29 14:34:41 +00:00
Michael Sproul
66eca1a882 Refactor op pool for speed and correctness (#3312)
## Proposed Changes

This PR has two aims: to speed up attestation packing in the op pool, and to fix bugs in the verification of attester slashings, proposer slashings and voluntary exits. The changes are bundled into a single database schema upgrade (v12).

Attestation packing is sped up by removing several inefficiencies: 

- No more recalculation of `attesting_indices` during packing.
- No (unnecessary) examination of the `ParticipationFlags`: a bitfield suffices. See `RewardCache`.
- No re-checking of attestation validity during packing: the `AttestationMap` provides attestations which are "correct by construction" (I have checked this using Hydra).
- No SSZ re-serialization for the clunky `AttestationId` type (it can be removed in a future release).

So far the speed-up seems to be roughly 2-10x, from 500ms down to 50-100ms.

Verification of attester slashings, proposer slashings and voluntary exits is fixed by:

- Tracking the `ForkVersion`s that were used to verify each message inside the `SigVerifiedOp`. This allows us to quickly re-verify that they match the head state's opinion of what the `ForkVersion` should be at the epoch(s) relevant to the message.
- Storing the `SigVerifiedOp` on disk rather than the raw operation. This allows us to continue track the fork versions after a reboot.

This is mostly contained in this commit 52bb1840ae5c4356a8fc3a51e5df23ed65ed2c7f.

## Additional Info

The schema upgrade uses the justified state to re-verify attestations and compute `attesting_indices` for them. It will drop any attestations that fail to verify, by the logic that attestations are most valuable in the few slots after they're observed, and are probably stale and useless by the time a node restarts. Exits and proposer slashings and similarly re-verified to obtain `SigVerifiedOp`s.

This PR contains a runtime killswitch `--paranoid-block-proposal` which opts out of all the optimisations in favour of closely verifying every included message. Although I'm quite sure that the optimisations are correct this flag could be useful in the event of an unforeseen emergency.

Finally, you might notice that the `RewardCache` appears quite useless in its current form because it is only updated on the hot-path immediately before proposal. My hope is that in future we can shift calls to `RewardCache::update` into the background, e.g. while performing the state advance. It is also forward-looking to `tree-states` compatibility, where iterating and indexing `state.{previous,current}_epoch_participation` is expensive and needs to be minimised.
2022-08-29 09:10:26 +00:00
Divma
8c69d57c2c Pause sync when EE is offline (#3428)
## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2022-08-24 23:34:56 +00:00
Paul Hauner
931153885c Run per-slot fork choice at a further distance from the head (#3487)
## Issue Addressed

NA

## Proposed Changes

Run fork choice when the head is 256 slots from the wall-clock slot, rather than 4.

The reason we don't *always* run FC is so that it doesn't slow us down during sync. As the comments state, setting the value to 256 means that we'd only have one interrupting fork-choice call if we were syncing at 20 slots/sec.

## Additional Info

NA
2022-08-19 04:27:24 +00:00
Paul Hauner
df358b864d Add metrics for EE PayloadStatus returns (#3486)
## Issue Addressed

NA

## Proposed Changes

Adds some metrics so we can track payload status responses from the EE. I think this will be useful for troubleshooting and alerting.

I also bumped the `BecaonChain::per_slot_task` to `debug` since it doesn't seem too noisy and would have helped us with some things we were debugging in the past.

## Additional Info

NA
2022-08-19 04:27:23 +00:00
Paul Hauner
043fa2153e Revise EE peer penalites (#3485)
## Issue Addressed

NA

## Proposed Changes

Don't penalize peers for errors that might be caused by an honest optimistic node.

## Additional Info

NA
2022-08-19 04:27:22 +00:00
Michael Sproul
25e3dc9300 Fix block verification and checkpoint sync caches (#3466)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/2962

## Proposed Changes

Build all caches on the checkpoint state before storing it in the database.

Additionally, fix a bug in `signature_verify_chain_segment` which prevented block verification from succeeding unless the previous epoch cache was already built. The previous epoch cache is required to verify the signatures of attestations included from previous epochs, even when all the blocks in the segment are from the same epoch.

The comments around `signature_verify_chain_segment` have also been updated to reflect the fact that it should only be used on a chain of blocks from a single epoch. I believe this restriction had already been added at some point in the past and that the current comments were just outdated (and I think because the proposer shuffling can change in the next epoch based on the blocks applied in the current epoch that this limitation is essential).
2022-08-15 01:31:00 +00:00
Paul Hauner
f03f9ba680 Increase merge-readiness lookhead (#3463)
## Issue Addressed

NA

## Proposed Changes

Start issuing merge-readiness logs 2 weeks before the Bellatrix fork epoch. Additionally, if the Bellatrix epoch is specified and the use has configured an EL, always log merge readiness logs, this should benefit pro-active users.

### Lookahead Reasoning

- Bellatrix fork is:
    - epoch 144896
    - slot 4636672
    - Unix timestamp: `1606824023 + (4636672 * 12) = 1662464087`
    - GMT: Tue Sep 06 2022 11:34:47 GMT+0000
- Warning start time is:
    - Unix timestamp: `1662464087 - 604800 * 2 = 1661254487`
    - GMT: Tue Aug 23 2022 11:34:47 GMT+0000

The [current expectation](https://discord.com/channels/595666850260713488/745077610685661265/1007445305198911569) is that EL and CL clients will releases out by Aug 22nd at the latest, then an EF announcement will go out on the 23rd. If all goes well, LH will start alerting users about merge-readiness just after the announcement.

## Additional Info

NA
2022-08-15 01:30:59 +00:00
Michael Sproul
4e05f19fb5 Serve Bellatrix preset in BN API (#3425)
## Issue Addressed

Resolves #3388
Resolves #2638

## Proposed Changes

- Return the `BellatrixPreset` on `/eth/v1/config/spec` by default.
- Allow users to opt out of this by providing `--http-spec-fork=altair` (unless there's a Bellatrix fork epoch set).
- Add the Altair constants from #2638 and make serving the constants non-optional (the `http-disable-legacy-spec` flag is deprecated).
- Modify the VC to only read the `Config` and not to log extra fields. This prevents it from having to muck around parsing the `ConfigAndPreset` fields it doesn't need.

## Additional Info

This change is backwards-compatible for the VC and the BN, but is marked as a breaking change for the removal of `--http-disable-legacy-spec`.

I tried making `Config` a `superstruct` too, but getting the automatic decoding to work was a huge pain and was going to require a lot of hacks, so I gave up in favour of keeping the default-based approach we have now.
2022-08-10 07:52:59 +00:00
Pawan Dhananjay
c25934956b Remove INVALID_TERMINAL_BLOCK (#3385)
## Issue Addressed

Resolves #3379 

## Proposed Changes

Remove instances of `InvalidTerminalBlock` in lighthouse and use 
`Invalid {latest_valid_hash: "0x0000000000000000000000000000000000000000000000000000000000000000"}` 
to represent that status.
2022-08-10 07:52:58 +00:00
realbigsean
6f13727fbe Don't use the builder network if the head is optimistic (#3412)
## Issue Addressed

Resolves https://github.com/sigp/lighthouse/issues/3394

Adds a check in `is_healthy` about whether the head is optimistic when choosing whether to use the builder network. 



Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-08-09 06:05:16 +00:00
Paul Hauner
a688621919 Add support for beaconAPI in lcli functions (#3252)
## Issue Addressed

NA

## Proposed Changes

Modifies `lcli skip-slots` and `lcli transition-blocks` allow them to source blocks/states from a beaconAPI and also gives them some more features to assist with benchmarking.

## Additional Info

Breaks the current `lcli skip-slots` and `lcli transition-blocks` APIs by changing some flag names. It should be simple enough to figure out the changes via `--help`.

Currently blocked on #3263.
2022-08-09 06:05:13 +00:00
Michael Sproul
6bc4a2cc91 Update invalid head tests (#3400)
## Proposed Changes

Update the invalid head tests so that they work with the current default fork choice configuration.

Thanks @realbigsean for fixing the persistence test and the EF tests.

Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-08-05 23:41:09 +00:00
Paul Hauner
d0beecca20 Make fork choice prune again (#3408)
## Issue Addressed

NA

## Proposed Changes

There was a regression in #3244 (released in v2.4.0) which stopped pruning fork choice (see [here](https://github.com/sigp/lighthouse/pull/3244#discussion_r935187485)).

This would form a very slow memory leak, using ~100mb per month. The release has been out for ~11 days, so users should not be seeing a dangerous increase in memory, *yet*.

Credits to @michaelsproul for noticing this 🎉 

## Additional Info

NA
2022-08-02 07:58:42 +00:00
Paul Hauner
bcfde6e7df Indicate that invalid blocks are optimistic (#3383)
## Issue Addressed

NA

## Proposed Changes

This PR will make Lighthouse return blocks with invalid payloads via the API with `execution_optimistic = true`. This seems a bit awkward, however I think it's better than returning a 404 or some other error.

Let's consider the case where the only possible head is invalid (#3370 deals with this). In such a scenario all of the duties endpoints will start failing because the head is invalid. I think it would be better if the duties endpoints continue to work, because it's likely that even though the head is invalid the duties are still based upon valid blocks and we want the VC to have them cached. There's no risk to the VC here because we won't actually produce an attestation pointing to an invalid head.

Ultimately, I don't think it's particularly important for us to distinguish between optimistic and invalid blocks on the API. Neither should be trusted and the only *real* reason that we track this is so we can try and fork around the invalid blocks.


## Additional Info

- ~~Blocked on #3370~~
2022-07-30 05:08:57 +00:00
ethDreamer
034260bd99 Initial Commit of Retrospective OTB Verification (#3372)
## Issue Addressed

* #2983 

## Proposed Changes

Basically followed the [instructions laid out here](https://github.com/sigp/lighthouse/issues/2983#issuecomment-1062494947)


Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
2022-07-30 00:22:38 +00:00
realbigsean
6c2d8b2262 Builder Specs v0.2.0 (#3134)
## Issue Addressed

https://github.com/sigp/lighthouse/issues/3091

Extends https://github.com/sigp/lighthouse/pull/3062, adding pre-bellatrix block support on blinded endpoints and allowing the normal proposal flow (local payload construction) on blinded endpoints. This resulted in better fallback logic because the VC will not have to switch endpoints on failure in the BN <> Builder API, the BN can just fallback immediately and without repeating block processing that it shouldn't need to. We can also keep VC fallback from the VC<>BN API's blinded endpoint to full endpoint.

## Proposed Changes

- Pre-bellatrix blocks on blinded endpoints
- Add a new `PayloadCache` to the execution layer
- Better fallback-from-builder logic

## Todos

- [x] Remove VC transition logic
- [x] Add logic to only enable builder flow after Merge transition finalization
- [x] Tests
- [x] Fix metrics
- [x] Rustdocs


Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-07-30 00:22:37 +00:00
Paul Hauner
25f0e261cb Don't return errors when fork choice fails (#3370)
## Issue Addressed

NA

## Proposed Changes

There are scenarios where the only viable head will have an invalid execution payload, in this scenario the `get_head` function on `proto_array` will return an error. We must recover from this scenario by importing blocks from the network.

This PR stops `BeaconChain::recompute_head` from returning an error so that we can't accidentally start down-scoring peers or aborting block import just because the current head has an invalid payload.

## Reviewer Notes

The following changes are included:

1. Allow `fork_choice.get_head` to fail gracefully in `BeaconChain::process_block` when trying to update the `early_attester_cache`; simply don't add the block to the cache rather than aborting the entire process.
1. Don't return an error from `BeaconChain::recompute_head_at_current_slot` and `BeaconChain::recompute_head` to defensively prevent calling functions from aborting any process just because the fork choice function failed to run.
    - This should have practically no effect, since most callers were still continuing if recomputing the head failed.
    - The outlier is that the API will return 200 rather than a 500 when fork choice fails.
1. Add the `ProtoArrayForkChoice::set_all_blocks_to_optimistic` function to recover from the scenario where we've rebooted and the persisted fork choice has an invalid head.
2022-07-28 13:57:09 +00:00
Michael Sproul
d04fde3ba9 Remove equivocating validators from fork choice (#3371)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/3241
Closes https://github.com/sigp/lighthouse/issues/3242

## Proposed Changes

* [x] Implement logic to remove equivocating validators from fork choice per https://github.com/ethereum/consensus-specs/pull/2845
* [x] Update tests to v1.2.0-rc.1. The new test which exercises `equivocating_indices` is passing.
* [x] Pull in some SSZ abstractions from the `tree-states` branch that make implementing Vec-compatible encoding for types like `BTreeSet` and `BTreeMap`.
* [x] Implement schema upgrades and downgrades for the database (new schema version is V11).
* [x] Apply attester slashings from blocks to fork choice

## Additional Info

* This PR doesn't need the `BTreeMap` impl, but `tree-states` does, and I don't think there's any harm in keeping it. But I could also be convinced to drop it.

Blocked on #3322.
2022-07-28 09:43:41 +00:00