Commit Graph

1873 Commits

Author SHA1 Message Date
Michael Sproul
d21f083777 Add more paths to HTTP API metrics (#3282)
## Proposed Changes

Expand the set of paths tracked by the HTTP API metrics to include all paths hit by the validator client.

These paths were only partially updated for Altair, so we were missing some of the sync committee and v2 APIs.
2022-06-23 05:19:21 +00:00
Paul Hauner
748658e32c Add some debug logs for checkpoint sync (#3281)
## Issue Addressed

NA

## Proposed Changes

I used these logs when debugging a spurious failure with Infura and thought they might be nice to have around permanently.

There's no changes to functionality in this PR, just some additional `debug!` logs.

## Additional Info

NA
2022-06-23 05:19:20 +00:00
Divma
7af5742081 Deprecate step param in BlocksByRange RPC request (#3275)
## Issue Addressed

Deprecates the step parameter in the blocks by range request

## Proposed Changes

- Modifies the BlocksByRangeRequest type to remove the step parameter and everywhere we took it into account before
- Adds a new type to still handle coding and decoding of requests that use the parameter

## Additional Info
I went with a deprecation over the type itself so that requests received outside `lighthouse_network` don't even need to deal with this parameter. After the deprecation period just removing the Old blocks by range request should be straightforward
2022-06-22 16:23:34 +00:00
Divma
2063c0fa0d Initial work to remove engines fallback from the execution_layer crate (#3257)
## Issue Addressed

Part of #3160 

## Proposed Changes
Use only the first url given in the execution engine, if more than one is provided log it.
This change only moves having multiple engines to one. The amount of code cleanup that can and should be done forward is not small and would interfere with ongoing PRs. I'm keeping the changes intentionally very very minimal.

## Additional Info

Future works:
- In [ `EngineError` ](9c429d0764/beacon_node/execution_layer/src/engines.rs (L173-L177)) the id is not needed since it now has no meaning.
- the [ `first_success_without_retry` ](9c429d0764/beacon_node/execution_layer/src/engines.rs (L348-L351)) function can return a single error.
- the [`first_success`](9c429d0764/beacon_node/execution_layer/src/engines.rs (L324)) function can return a single error.
- After the redundancy is removed for the builders, we can probably make the [ `EngineErrors` ](9c429d0764/beacon_node/execution_layer/src/lib.rs (L69)) carry a single error.
- Merge the [`Engines`](9c429d0764/beacon_node/execution_layer/src/engines.rs (L161-L165)) struct and [`Engine` ](9c429d0764/beacon_node/execution_layer/src/engines.rs (L62-L67))
- Fix the associated configurations and cli params. Not sure if both are done in https://github.com/sigp/lighthouse/pull/3214

In general I think those changes can be done incrementally and in individual pull requests.
2022-06-22 14:27:16 +00:00
Michael Sproul
efebf712dd Avoid cloning snapshots during sync (#3271)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/2944

## Proposed Changes

Remove snapshots from the cache during sync rather than cloning them. This reduces unnecessary cloning and memory fragmentation during sync.

## Additional Info

This PR relies on the fact that the `block_delay` cache is not populated for blocks from sync. Relying on block delay may have the side effect that a change in `block_delay` calculation could lead to: a) more clones, if block delays are added for syncing blocks or b) less clones, if blocks near the head are erroneously provided without a `block_delay`. Case (a) would be a regression to the current status quo, and (b) is low-risk given we know that the snapshot cache is current susceptible to misses (hence `tree-states`).
2022-06-20 23:20:29 +00:00
eklm
a9e158663b Fix validator_monitor_prev_epoch_ metrics (#2911)
## Issue Addressed

#2820

## Proposed Changes

The problem is that validator_monitor_prev_epoch metrics are updated only if there is EpochSummary present in summaries map for the previous epoch and it is not the case for the offline validator. Ensure that EpochSummary is inserted into summaries map also for the offline validators.
2022-06-20 04:06:30 +00:00
Pawan Dhananjay
f428719761 Do not penalize peers on execution layer offline errors (#3258)
## Issue Addressed

Partly resolves https://github.com/sigp/lighthouse/issues/3032

## Proposed Changes

Extracts some of the functionality of #3094 into a separate PR as the original PR requires a bit more work.
Do not unnecessarily penalize peers when we fail to validate received execution payloads because our execution layer is offline.
2022-06-19 23:13:40 +00:00
Paul Hauner
564d7da656 v2.3.1 (#3262)
## Issue Addressed

NA

## Proposed Changes

Bump versions

## Additional Info

NA
2022-06-14 05:25:38 +00:00
Divma
3dd50bda11 Improve substream management (#3261)
## Issue Addressed

Which issue # does this PR address?

## Proposed Changes

Please list or describe the changes introduced by this PR.

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.
2022-06-10 06:58:50 +00:00
Paul Hauner
11d80a6a38 Optimise per_epoch_processing low-hanging-fruit (#3254)
## Issue Addressed

NA

## Proposed Changes

- Uses a `Vec` in `SingleEpochParticipationCache` rather than `HashMap` to speed up processing times at the cost of memory usage.
- Cache the result of `integer_sqrt` rather than recomputing for each validator.
- Cache `state.previous_epoch` rather than recomputing it for each validator.

### Benchmarks

Benchmarks on a recent mainnet state using #3252 to get timing.

#### Without this PR

```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:21:02Z INFO  lcli::skip_slots] Using mainnet spec
[2022-06-09T08:21:02Z INFO  lcli::skip_slots] Advancing 32 slots
[2022-06-09T08:21:02Z INFO  lcli::skip_slots] Doing 10 runs
[2022-06-09T08:21:02Z INFO  lcli::skip_slots] State path: "/tmp/state-0x3cdc.ssz"
SSZ decoding /tmp/state-0x3cdc.ssz: 43ms
[2022-06-09T08:21:03Z INFO  lcli::skip_slots] Run 0: 245.718794ms
[2022-06-09T08:21:03Z INFO  lcli::skip_slots] Run 1: 245.364782ms
[2022-06-09T08:21:03Z INFO  lcli::skip_slots] Run 2: 255.866179ms
[2022-06-09T08:21:04Z INFO  lcli::skip_slots] Run 3: 243.838909ms
[2022-06-09T08:21:04Z INFO  lcli::skip_slots] Run 4: 250.431425ms
[2022-06-09T08:21:04Z INFO  lcli::skip_slots] Run 5: 248.68765ms
[2022-06-09T08:21:04Z INFO  lcli::skip_slots] Run 6: 262.051113ms
[2022-06-09T08:21:05Z INFO  lcli::skip_slots] Run 7: 264.293967ms
[2022-06-09T08:21:05Z INFO  lcli::skip_slots] Run 8: 293.202007ms
[2022-06-09T08:21:05Z INFO  lcli::skip_slots] Run 9: 264.552017ms
```

#### With this PR:

```
lcli skip-slots --state-path /tmp/state-0x3cdc.ssz --partial-state-advance --slots 32 --state-root 0x3cdc33cd02713d8d6cc33a6dbe2d3a5bf9af1d357de0d175a403496486ff845e --runs 10
[2022-06-09T08:57:59Z INFO  lcli::skip_slots] Run 0: 73.898678ms
[2022-06-09T08:57:59Z INFO  lcli::skip_slots] Run 1: 75.536978ms
[2022-06-09T08:57:59Z INFO  lcli::skip_slots] Run 2: 75.176104ms
[2022-06-09T08:57:59Z INFO  lcli::skip_slots] Run 3: 76.460828ms
[2022-06-09T08:57:59Z INFO  lcli::skip_slots] Run 4: 75.904195ms
[2022-06-09T08:58:00Z INFO  lcli::skip_slots] Run 5: 75.53077ms
[2022-06-09T08:58:00Z INFO  lcli::skip_slots] Run 6: 74.745572ms
[2022-06-09T08:58:00Z INFO  lcli::skip_slots] Run 7: 75.823489ms
[2022-06-09T08:58:00Z INFO  lcli::skip_slots] Run 8: 74.892055ms
[2022-06-09T08:58:00Z INFO  lcli::skip_slots] Run 9: 76.333569ms
```

## Additional Info

NA
2022-06-10 04:29:28 +00:00
Divma
56b4cd88ca minor libp2p upgrade (#3259)
## Issue Addressed

Upgrades libp2p
2022-06-09 23:48:51 +00:00
Divma
cfd26d25e0 do not count sync batch attempts when peer is not at fault (#3245)
## Issue Addressed
currently we count a failed attempt for a syncing chain even if the peer is not at fault. This makes us do more work if the chain fails, and heavily penalize peers, when we can simply retry. Inspired by a proposal I made to #3094 

## Proposed Changes
If a batch fails but the peer is not at fault, do not count the attempt
Also removes some annoying logs

## Additional Info
We still get a counter on ignored attempts.. just in case
2022-06-07 02:35:56 +00:00
Divma
58e223e429 update libp2p (#3233)
## Issue Addressed
na

## Proposed Changes
Updates libp2p to https://github.com/libp2p/rust-libp2p/pull/2662

## Additional Info
From comments on the relevant PRs listed, we should pay attention at peer management consistency, but I don't think anything weird will happen.
This is running in prater tok and sin
2022-06-07 02:35:55 +00:00
Michael Sproul
54cf94ea59 Fix per-slot timer in presence of clock changes (#3243)
## Issue Addressed

Fixes a timing issue that results in spurious fork choice notifier failures:

```
WARN Error signalling fork choice waiter     slot: 3962270, error: ForkChoiceSignalOutOfOrder { current: Slot(3962271), latest: Slot(3962270) }, service: beacon
```

There’s a fork choice run that is scheduled to run at the start of every slot by the `timer`, which creates a 12s interval timer when the beacon node starts up. The problem is that if there’s a bit of clock drift that gets corrected via NTP (or a leap second for that matter) then these 12s intervals will cease to line up with the start of the slot. This then creates the mismatch in slot number that we see above.

Lighthouse also runs fork choice 500ms before the slot begins, and these runs are what is conflicting with the start-of-slot runs. This means that the warning in current versions of Lighthouse is mostly cosmetic because fork choice is up to date with all but the most recent 500ms of attestations (which usually isn’t many).

## Proposed Changes

Fix the per-slot timer so that it continually re-calculates the duration to the start of the next slot and waits for that.

A side-effect of this change is that we may skip slots if the per-slot task takes >12s to run, but I think this is an unlikely scenario and an acceptable compromise.
2022-06-06 23:52:32 +00:00
Divma
493c2c037c reduce reprocess queue/channel sizes (#3239)
## Issue Addressed

Reduces the effect of late blocks on overall node buildup

## Proposed Changes

change the capacity of the channels used to send work for reprocessing in the beacon processor, and to send back to the main processor task, to be 75% of the capacity of the channel for receiving new events

## Additional Info

The issues we've seen suggest we should still evaluate node performance under stress, with late blocks being a big factor. 
Other changes that could help: 
1. right now we have a cap for queued attestations for reprocessing that applies to the sum of aggregated and unaggregated attestations. We could consider adding a separate cap that favors aggregated ones.
2. solving #2848
2022-06-06 23:52:31 +00:00
Akihito Nakano
a6d2ed6119 Fix: PeerManager doesn't remove "outbound only" peers which should be pruned (#3236)
## Issue Addressed

This is one step to address https://github.com/sigp/lighthouse/issues/3092 before introducing `quickcheck`.

I noticed an issue while I was reading the pruning implementation `PeerManager::prune_excess_peers()`. If a peer with the following condition, **`outbound_peers_pruned` counter increases but the peer is not pushed to `peers_to_prune`**.

- [outbound only](1e4ac8a4b9/beacon_node/lighthouse_network/src/peer_manager/mod.rs (L1018))
- [min_subnet_count <= MIN_SYNC_COMMITTEE_PEERS](1e4ac8a4b9/beacon_node/lighthouse_network/src/peer_manager/mod.rs (L1047))

As a result, PeerManager doesn't remove "outbound" peers which should be pruned.

Note: [`subnet_to_peer`](e0d673ea86/beacon_node/lighthouse_network/src/peer_manager/mod.rs (L999)) (HashMap) doesn't guarantee a particular order of iteration. So whether the test fails depend on the order of iteration.
2022-06-06 05:51:10 +00:00
Michael Sproul
47d57a290b Improve eth1 block cache sync (for Ropsten) (#3234)
## Issue Addressed

Fix for the eth1 cache sync issue observed on Ropsten.

## Proposed Changes

Ropsten blocks are so infrequent that they broke our algorithm for downloading eth1 blocks. We currently try to download forwards from the last block in our cache to the block with block number [`remote_highest_block - FOLLOW_DISTANCE + FOLLOW_DISTANCE / ETH1_BLOCK_TIME_TOLERANCE_FACTOR`](6f732986f1/beacon_node/eth1/src/service.rs (L489-L492)). With the tolerance set to 4 this is insufficient because we lag by 1536 blocks, which is more like ~14 hours on Ropsten. This results in us having an incomplete eth1 cache, because we should cache all blocks between -16h and -8h. Even if we were to set the tolerance to 2 for the largest allowance, we would only look back 1024 blocks which is still more than 8 hours.

For example consider this block https://ropsten.etherscan.io/block/12321390. The block from 1536 blocks earlier is 14 hours and 20 minutes before it: https://ropsten.etherscan.io/block/12319854. The block from 1024 blocks earlier is https://ropsten.etherscan.io/block/12320366, 8 hours and 48 minutes before.

- This PR introduces a new CLI flag called `--eth1-cache-follow-distance` which can be used to set the distance manually.
- A new dynamic catchup mechanism is added which detects when the cache is lagging the true eth1 chain and tries to download more blocks within the follow distance in order to catch up.
2022-06-03 06:05:03 +00:00
Mac L
55ac423872 Emit log when fee recipient values are inconsistent (#3202)
## Issue Addressed

#3156

## Proposed Changes

Emit a `WARN` log whenever the value of `fee_recipient` as returned from the EE is different from the value of `suggested_fee_recipient` as set on the BN, for example by the `--suggested-fee-recipient` CLI flag.

## Additional Info

I have set the log level to `WARN` since it is legal behaviour (meaning it isn't really an error but is important to know when it is occurring).

If we feel like this behaviour is almost always undesired (caused by a misconfiguration or malicious EE) then an `ERRO` log would be more appropriate. Happy to change it in that case.
2022-06-03 03:22:54 +00:00
Paul Hauner
16e49af8e1 Use genesis slot for node/syncing (#3226)
## Issue Addressed

NA

## Proposed Changes

Resolves this error log emitted from the VC prior to genesis:

```
WARN Unable connect to beacon node           error: ServerMessage(ErrorMessage { code: 500, message: "UNHANDLED_ERROR: UnableToReadSlot", stacktraces: [] })
```

## Additional Info

NA
2022-05-31 06:09:11 +00:00
Michael Sproul
98c8ac1a87 Fix typo in peer state transition log (#3224)
## Issue Addressed

We were logging `out_finalized_epoch` instead of `our_finalized_epoch`. I noticed this ages ago but only just got around to fixing it.

## Additional Info

I also reformatted the log line to respect the line length limit (`rustfmt` won't do it because it gets confused by the `;` in slog's log macros).
2022-05-31 06:09:10 +00:00
Paul Hauner
6f732986f1 v2.3.0 (#3222)
## Issue Addressed

NA

## Proposed Changes

Please list or describe the changes introduced by this PR.

## Additional Info

- Pending testing on our infra. **Please do not merge**
2022-05-30 01:35:10 +00:00
Divma
a7896a58cc move backfill sync jobs from highest priority to lowest (#3215)
## Issue Addressed
#3212 

## Proposed Changes
Move chain segments coming from back-fill syncing from highest priority to lowest

## Additional Info
If this does not solve the issue, next steps would be lowering the batch size for back-fill sync, and as last resort throttling the processing of these chain segments
2022-05-26 02:05:17 +00:00
Paul Hauner
f4aa17ef85 v2.3.0-rc.0 (#3218)
## Issue Addressed

NA

## Proposed Changes

Bump versions

## Additional Info

NA
2022-05-25 05:29:26 +00:00
Michael Sproul
229f883968 Avoid parallel fork choice runs during sync (#3217)
## Issue Addressed

Fixes an issue that @paulhauner found with the v2.3.0 release candidate whereby the fork choice runs introduced by #3168 tripped over each other during sync:

```
May 24 23:06:40.542 WARN Error signalling fork choice waiter     slot: 3884129, error: ForkChoiceSignalOutOfOrder { current: Slot(3884131), latest: Slot(3884129) }, service: beacon
```

This can occur because fork choice is called from the state advance _and_ the per-slot task. When one of these runs takes a long time it can end up finishing after a run from a later slot, tripping the error above. The problem is resolved by not running either of these fork choice calls during sync.

Additionally, these parallel fork choice runs were causing issues in the database:

```
May 24 07:49:05.098 WARN Found a chain that should already have been pruned, head_slot: 92925, head_block_root: 0xa76c7bf1b98e54ed4b0d8686efcfdf853484e6c2a4c67e91cbf19e5ad1f96b17, service: beacon
May 24 07:49:05.101 WARN Database migration failed               error: HotColdDBError(FreezeSlotError { current_split_slot: Slot(92608), proposed_split_slot: Slot(92576) }), service: beacon
```

In this case, two fork choice calls triggering the finalization processing were being processed out of order due to differences in their processing time, causing the background migrator to try to advance finalization _backwards_ 😳. Removing the parallel fork choice runs from sync effectively addresses the issue, because these runs are most likely to have different finalized checkpoints (because of the speed at which fork choice advances during sync). In theory it's still possible to process updates out of order if any other fork choice runs end up completing out of order, but this should be much less common. Fixing out of order fork choice runs in general is difficult as it requires architectural changes like serialising fork choice updates through a single thread, or locking fork choice along with the head when it is mutated (https://github.com/sigp/lighthouse/pull/3175).

## Proposed Changes

* Don't run per-slot fork choice during sync (if head is older than 4 slots)
* Don't run state-advance fork choice during sync (if head is older than 4 slots)
* Check for monotonic finalization updates in the background migrator. This is a good defensive check to have, and I'm not sure why we didn't have it before (we may have had it and wrongly removed it).
2022-05-25 03:27:30 +00:00
Paul Hauner
7a64994283 Call per_slot_task from a blocking thread (v2) (#3199)
*This PR was adapted from @pawanjay176's work in #3197.*

## Issue Addressed

Fixes a regression in https://github.com/sigp/lighthouse/pull/3168

## Proposed Changes

https://github.com/sigp/lighthouse/pull/3168 added calls to `fork_choice` in  `BeaconChain::per_slot_task` function. This leads to a panic as `per_slot_task` is called from an async context which calls fork choice, which then calls `block_on`.

This PR changes the timer to call the `per_slot_task` function in a blocking thread.

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2022-05-20 23:05:07 +00:00
Michael Sproul
6eaeaa542f Fix Rust 1.61 clippy lints (#3192)
## Issue Addressed

This fixes the low-hanging Clippy lints introduced in Rust 1.61 (due any hour now). It _ignores_ one lint, because fixing it requires a structural refactor of the validator client that needs to be done delicately. I've started on that refactor and will create another PR that can be reviewed in more depth in the coming days. I think we should merge this PR in the meantime to unblock CI.
2022-05-20 05:02:13 +00:00
Michael Sproul
8fa032c8ae Run fork choice before block proposal (#3168)
## Issue Addressed

Upcoming spec change https://github.com/ethereum/consensus-specs/pull/2878

## Proposed Changes

1. Run fork choice at the start of every slot, and wait for this run to complete before proposing a block.
2. As an optimisation, also run fork choice 3/4 of the way through the slot (at 9s), _dequeueing attestations for the next slot_.
3. Remove the fork choice run from the state advance timer that occurred before advancing the state.

## Additional Info

### Block Proposal Accuracy

This change makes us more likely to propose on top of the correct head in the presence of re-orgs with proposer boost in play. The main scenario that this change is designed to address is described in the linked spec issue.

### Attestation Accuracy

This change _also_ makes us more likely to attest to the correct head. Currently in the case of a skipped slot at `slot` we only run fork choice 9s into `slot - 1`. This means the attestations from `slot - 1` aren't taken into consideration, and any boost applied to the block from `slot - 1` is not removed (it should be). In the language of the linked spec issue, this means we are liable to attest to C, even when the majority voting weight has already caused a re-org to B.

### Why remove the call before the state advance?

If we've run fork choice at the start of the slot then it has already dequeued all the attestations from the previous slot, which are the only ones eligible to influence the head in the current slot. Running fork choice again is unnecessary (unless we run it for the next slot and try to pre-empt a re-org, but I don't currently think this is a great idea).

### Performance

Based on Prater testing this adds about 5-25ms of runtime to block proposal times, which are 500-1000ms on average (and spike to 5s+ sometimes due to state handling issues 😢 ). I believe this is a small enough penalty to enable it by default, with the option to disable it via the new flag `--fork-choice-before-proposal-timeout 0`. Upcoming work on block packing and state representation will also reduce block production times in general, while removing the spikes.

### Implementation

Fork choice gets invoked at the start of the slot via the `per_slot_task` function called from the slot timer. It then uses a condition variable to signal to block production that fork choice has been updated. This is a bit funky, but it seems to work. One downside of the timer-based approach is that it doesn't happen automatically in most of the tests. The test added by this PR has to trigger the run manually.
2022-05-20 05:02:11 +00:00
realbigsean
54b58fdc01 Log out response status when we hit PayloadIdUnavailable (#3190)
## Issue Addressed

@z3n-chada is currently getting a `PayloadIdUnavailable` error when connecting lighthouse to Erigon and it's difficult to discern why so this just logs out the response status from the EE when we hit an `PayloadIdUnavailable` error

Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-05-19 06:00:48 +00:00
Akihito Nakano
695f415590 Tiny improvement: PeerManager and maximum discovery query (#3182)
## Issue Addressed

As [`Discovery` bounds the maximum discovery query](e88b18be09/beacon_node/lighthouse_network/src/discovery/mod.rs (L328)), `PeerManager` no need to handle it.

e88b18be09/beacon_node/lighthouse_network/src/discovery/mod.rs (L328)
2022-05-19 06:00:46 +00:00
Mac L
def9bc660e Remove DB migrations for legacy database schemas (#3181)
## Proposed Changes

Remove support for DB migrations that support upgrading from schema's below version 5. This is mostly for cosmetic/code quality reasons as in most circumstances upgrading from versions of Lighthouse this old will almost always require a re-sync.

## Additional Info

The minimum supported database schema is now version 5.
2022-05-17 04:54:39 +00:00
Paul Hauner
db8a6f81ea Prevent attestation to future blocks from early attester cache (#3183)
## Issue Addressed

N/A

## Proposed Changes

Prevents the early attester cache from producing attestations to future blocks. This bug could result in a missed head vote if the BN was requested to produce an attestation for an earlier slot than the head block during the (usually) short window of time between verifying a block and setting it as the head.

This bug was noticed in an [Antithesis](https://andreagrieser.com/) test and diagnosed by @realbigsean. 

## Additional Info

NA
2022-05-17 01:51:25 +00:00
Paul Hauner
38050fa460 Allow TaskExecutor to be used in async tests (#3178)
# Description

Since the `TaskExecutor` currently requires a `Weak<Runtime>`, it's impossible to use it in an async test where the `Runtime` is created outside our scope. Whilst we *could* create a new `Runtime` instance inside the async test, dropping that `Runtime` would cause a panic (you can't drop a `Runtime` in an async context).

To address this issue, this PR creates the `enum Handle`, which supports either:

- A `Weak<Runtime>` (for use in our production code)
- A `Handle` to a runtime (for use in testing)

In theory, there should be no change to the behaviour of our production code (beyond some slightly different descriptions in HTTP 500 errors), or even our tests. If there is no change, you might ask *"why bother?"*. There are two PRs (#3070 and #3175) that are waiting on these fixes to introduce some new tests. Since we've added the EL to the `BeaconChain` (for the merge), we are now doing more async stuff in tests.

I've also added a `RuntimeExecutor` to the `BeaconChainTestHarness`. Whilst that's not immediately useful, it will become useful in the near future with all the new async testing.
2022-05-16 08:35:59 +00:00
François Garillot
3f9e83e840 [refactor] Refactor Option/Result combinators (#3180)
Code simplifications using `Option`/`Result` combinators to make pattern-matches a tad simpler. 
Opinions on these loosely held, happy to adjust in review.

Tool-aided by [comby-rust](https://github.com/huitseeker/comby-rust).
2022-05-16 01:59:47 +00:00
Michael Sproul
bcdd960ab1 Separate execution payloads in the DB (#3157)
## Proposed Changes

Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.

⚠️ **This is achieved in a backwards-incompatible way for networks that have already merged** ⚠️. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.

The main changes are:

- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
    - `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
    - `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
   - I've tested manually that it works on Kiln, using Geth and Nethermind.
   - This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
   - We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).

## Additional Info

- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.

Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-05-12 00:42:17 +00:00
Michael Sproul
ae47a93c42 Don't panic in forkchoiceUpdated handler (#3165)
## Issue Addressed

Fix a panic due to misuse of the Tokio executor when processing a forkchoiceUpdated response. We were previously calling `process_invalid_execution_payload` from the async function `update_execution_engine_forkchoice_async`, which resulted in a panic because `process_invalid_execution_payload` contains a call to fork choice, which ultimately calls `block_on`.

An example backtrace can be found here: https://gist.github.com/michaelsproul/ac5da03e203d6ffac672423eaf52fb20

## Proposed Changes

Wrap the call to `process_invalid_execution_payload` in a `spawn_blocking` so that `block_on` is no longer called from an async context.

## Additional Info

- I've been thinking about how to catch bugs like this with static analysis (a new Clippy lint).
- The payload validation tests have been re-worked to support distinct responses from the mock EE for newPayload and forkchoiceUpdated. Three new tests have been added covering the `Invalid`, `InvalidBlockHash` and `InvalidTerminalBlock` cases.
- I think we need a bunch more tests of different legal and illegal variations
2022-05-04 23:30:34 +00:00
Pawan Dhananjay
db0beb5178 Poll shutdown timeout in rpc handler (#3153)
## Issue Addressed

N/A

## Proposed Changes

Previously, we were using `Sleep::is_elapsed()` to check if the shutdown timeout had triggered without polling the sleep. This PR polls the sleep timer.
2022-04-13 03:54:44 +00:00
Divma
580d2f7873 log upgrades + prevent dialing of disconnecting peers (#3148)
## Issue Addressed
We still ping peers that are considered in a disconnecting state

## Proposed Changes

Do not ping peers once we decide they are disconnecting
Upgrade logs about ignored rpc messages

## Additional Info
--
2022-04-13 03:54:43 +00:00
Paul Hauner
b49b4291a3 Disallow attesting to optimistic head (#3140)
## Issue Addressed

NA

## Proposed Changes

Disallow the production of attestations and retrieval of unaggregated attestations when they reference an optimistic head. Add tests to this end.

I also moved `BeaconChain::produce_unaggregated_attestation_for_block` to the `BeaconChainHarness`. It was only being used during tests, so it's nice to stop pretending it's production code. I also needed something that could produce attestations to optimistic blocks in order to simulate scenarios where the justified checkpoint is determined invalid (if no one would attest to an optimistic block, we could never justify it and then flip it to invalid).

## Additional Info

- ~~Blocked on #3126~~
2022-04-13 03:54:42 +00:00
Divma
7366266bd1 keep failed finalized chains to avoid retries (#3142)
## Issue Addressed

In very rare occasions we've seen most if not all our peers in a chain with which we don't agree. Purging these peers can take a very long time: number of retries of the chain. Meanwhile sync is caught in a loop trying the chain again and again. This makes it so that we fast track purging peers via registering the failed chain to prevent retrying for some time (30 seconds). Longer times could be dangerous since a chain can fail if a batch fails to download for example. In this case, I think it's still acceptable to fast track purging peers since they are nor providing the required info anyway 

Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
2022-04-13 01:10:55 +00:00
Michael Sproul
aa72088f8f v2.2.1 (#3149)
## Issue Addressed

Addresses sync stalls on v2.2.0 (i.e. https://github.com/sigp/lighthouse/issues/3147).

## Additional Info

I've avoided doing a full `cargo update` because I noticed there's a new patch version of libp2p and thought it could do with some more testing.



Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-04-12 02:52:12 +00:00
Paul Hauner
c8edeaff29 Don't log crits for missing EE before Bellatrix (#3150)
## Issue Addressed

NA

## Proposed Changes

Fixes an issue introduced in #3088 which was causing unnecessary `crit` logs on networks without Bellatrix enabled.

## Additional Info

NA
2022-04-11 23:14:47 +00:00
Pawan Dhananjay
fff4dd6311 Fix rpc limits version 2 (#3146)
## Issue Addressed

N/A

## Proposed Changes

https://github.com/sigp/lighthouse/pull/3133 changed the rpc type limits to be fork aware i.e. if our current fork based on wall clock slot is Altair, then we apply only altair rpc type limits. This is a bug because phase0 blocks can still be sent over rpc and phase 0 block minimum size is smaller than altair block minimum size. So a phase0 block with `size < SIGNED_BEACON_BLOCK_ALTAIR_MIN` will return an `InvalidData` error as it doesn't pass the rpc types bound check.

This error can be seen when we try syncing pre-altair blocks with size smaller than `SIGNED_BEACON_BLOCK_ALTAIR_MIN`.

This PR fixes the issue by also accounting for forks earlier than current_fork in the rpc limits calculation in the  `rpc_block_limits_by_fork` function. I decided to hardcode the limits in the function because that seemed simpler than calculating previous forks based on current fork and doing a min across forks. Adding a new fork variant is simple and can the limits can be easily checked in a review. 

Adds unit tests and modifies the syncing simulator to check the syncing from across fork boundaries. 
The syncing simulator's block 1 would always be of phase 0 minimum size (404 bytes) which is smaller than altair min block size (since block 1 contains no attestations).
2022-04-07 23:45:38 +00:00
ethDreamer
22002a4e68 Transition Block Proposer Preparation (#3088)
## Issue Addressed

- #3058 

Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-04-07 14:03:34 +00:00
Aren49
5ff4013263 Fix SPRP default value in cli (#3145)
Changed SPRP to the correct default value of 8192.
2022-04-07 04:04:11 +00:00
Paul Hauner
8a40763183 Ensure VALID response from fcU updates protoarray (#3126)
## Issue Addressed

NA

## Proposed Changes

Ensures that a `VALID` response from a `forkchoiceUpdate` call will update that block in `ProtoArray`.

I also had to modify the mock execution engine so it wouldn't return valid when all payloads were supposed to be some other static value.

## Additional Info

NA
2022-04-05 20:58:17 +00:00
Paul Hauner
42cdaf5840 Add tests for importing blocks on invalid parents (#3123)
## Issue Addressed

NA

## Proposed Changes

- Adds more checks to prevent importing blocks atop parent with invalid execution payloads.
- Adds a test for these conditions.

## Additional Info

NA
2022-04-05 20:58:16 +00:00
Michael Sproul
bac7c3fa54 v2.2.0 (#3139)
## Proposed Changes

Cut release v2.2.0 including proposer boost.

## Additional Info

I also updated the clippy lints for the imminent release of Rust 1.60, although LH v2.2.0 will continue to compile using Rust 1.58 (our MSRV).
2022-04-05 02:53:09 +00:00
Michael Sproul
4d0122444b Update and consolidate dependencies (#3136)
## Proposed Changes

I did some gardening 🌳 in our dependency tree:

- Remove duplicate versions of `warp` (git vs patch)
- Remove duplicate versions of lots of small deps: `cpufeatures`, `ethabi`, `ethereum-types`, `bitvec`, `nix`, `libsecp256k1`.
- Update MDBX (should resolve #3028). I tested and Lighthouse compiles on Windows 11 now.
- Restore `psutil` back to upstream
- Make some progress updating everything to rand 0.8. There are a few crates stuck on 0.7.

Hopefully this puts us on a better footing for future `cargo audit` issues, and improves compile times slightly.

## Additional Info

Some crates are held back by issues with `zeroize`. libp2p-noise depends on [`chacha20poly1305`](https://crates.io/crates/chacha20poly1305) which depends on zeroize < v1.5, and we can only have one version of zeroize because it's post 1.0 (see https://github.com/rust-lang/cargo/issues/6584). The latest version of `zeroize` is v1.5.4, which is used by the new versions of many other crates (e.g. `num-bigint-dig`). Once a new version of chacha20poly1305 is released we can update libp2p-noise and upgrade everything to the latest `zeroize` version.

I've also opened a PR to `blst` related to zeroize: https://github.com/supranational/blst/pull/111
2022-04-04 00:26:16 +00:00
Pawan Dhananjay
ab434bc075 Fix merge rpc length limits (#3133)
## Issue Addressed

N/A

## Proposed Changes

Fix the upper bound for blocks by root responses to be equal to the max merge block size instead of altair.
Further make the rpc response limits fork aware.
2022-04-04 00:26:15 +00:00
Michael Sproul
375e2b49b3 Conserve disk space by raising default SPRP (#3137)
## Proposed Changes

Increase the default `--slots-per-restore-point` to 8192 for a 4x reduction in freezer DB disk usage.

Existing nodes that use the previous default of 2048 will be left unchanged. Newly synced nodes (with or without checkpoint sync) will use the new 8192 default. 

Long-term we could do away with the freezer DB entirely for validator-only nodes, but this change is much simpler and grants us some extra space in the short term. We can also roll it out gradually across our nodes by purging databases one by one, while keeping the Ansible config the same.

## Additional Info

We ignore a change from 2048 to 8192 if the user hasn't set the 8192 explicitly. We fire a debug log in the case where we do ignore:

```
DEBG Ignoring slots-per-restore-point config in favour of on-disk value, on_disk: 2048, config: 8192
```
2022-04-01 07:16:25 +00:00