Commit Graph

4645 Commits

Author SHA1 Message Date
Paul Hauner
dadbd69eec Fix concurrency issue with oneshot_broadcast (#3596)
## Issue Addressed

NA

## Proposed Changes

Fixes an issue found during testing with #3595.

## Additional Info

NA
2022-09-21 10:52:14 +00:00
Paul Hauner
96692b8e43 Impl oneshot_broadcast for committee promises (#3595)
## Issue Addressed

NA

## Proposed Changes

Fixes an issue introduced in #3574 where I erroneously assumed that a `crossbeam_channel` multiple receiver queue was a *broadcast* queue. This is incorrect, each message will be received by *only one* receiver. The effect of this mistake is these logs:

```
Sep 20 06:56:17.001 INFO Synced                                  slot: 4736079, block: 0xaa8a…180d, epoch: 148002, finalized_epoch: 148000, finalized_root: 0x2775…47f2, exec_hash: 0x2ca5…ffde (verified), peers: 6, service: slot_notifier
Sep 20 06:56:23.237 ERRO Unable to validate attestation          error: CommitteeCacheWait(RecvError), peer_id: 16Uiu2HAm2Jnnj8868tb7hCta1rmkXUf5YjqUH1YPj35DCwNyeEzs, type: "aggregated", slot: Slot(4736047), beacon_block_root: 0x88d318534b1010e0ebd79aed60b6b6da1d70357d72b271c01adf55c2b46206c1
```

## Additional Info

NA
2022-09-21 01:01:50 +00:00
Paul Hauner
a95bcba2ab Avoid holding write-lock whilst waiting on shuffling cache promise (#3589)
## Issue Addressed

NA

## Proposed Changes

Fixes a bug which hogged the write-lock for the `shuffling_cache`.

## Additional Info

NA
2022-09-19 07:58:50 +00:00
Michael Sproul
507bb9dad4 Refined payload pruning (#3587)
## Proposed Changes

Improve the payload pruning feature in several ways:

- Payload pruning is now entirely optional. It is enabled by default but can be disabled with `--prune-payloads false`. The previous `--prune-payloads-on-startup` flag from #3565 is removed.
- Initial payload pruning on startup now runs in a background thread. This thread will always load the split state, which is a small fraction of its total work (up to ~300ms) and then backtrack from that state. This pruning process ran in 2m5s on one Prater node with good I/O and 16m on a node with slower I/O.
- To work with the optional payload pruning the database function `try_load_full_block` will now attempt to load execution payloads for finalized slots _if_ pruning is currently disabled. This gives users an opt-out for the extensive traffic between the CL and EL for reconstructing payloads.

## Additional Info

If the `prune-payloads` flag is toggled on and off then the on-startup check may not see any payloads to delete and fail to clean them up. In this case the `lighthouse db prune_payloads` command should be used to force a manual sweep of the database.
2022-09-19 07:58:49 +00:00
Michael Sproul
f2ac0738d8 Implement skip_randao_verification and blinded block rewards API (#3540)
## Issue Addressed

https://github.com/ethereum/beacon-APIs/pull/222

## Proposed Changes

Update Lighthouse's randao verification API to match the `beacon-APIs` spec. We implemented the API before spec stabilisation, and it changed slightly in the course of review.

Rather than a flag `verify_randao` taking a boolean value, the new API uses a `skip_randao_verification` flag which takes no argument. The new spec also requires the randao reveal to be present and equal to the point-at-infinity when `skip_randao_verification` is set.

I've also updated the `POST /lighthouse/analysis/block_rewards` API to take blinded blocks as input, as the execution payload is irrelevant and we may want to assess blocks produced by builders.

## Additional Info

This is technically a breaking change, but seeing as I suspect I'm the only one using these parameters/APIs, I think we're OK to include this in a patch release.
2022-09-19 07:58:48 +00:00
Michael Sproul
ca42ef2e5a Prune finalized execution payloads (#3565)
## Issue Addressed

Closes https://github.com/sigp/lighthouse/issues/3556

## Proposed Changes

Delete finalized execution payloads from the database in two places:

1. When running the finalization migration in `migrate_database`. We delete the finalized payloads between the last split point and the new updated split point. _If_ payloads are already pruned prior to this then this is sufficient to prune _all_ payloads as non-canonical payloads are already deleted by the head pruner, and all canonical payloads prior to the previous split will already have been pruned.
2. To address the fact that users will update to this code _after_ the merge on mainnet (and testnets), we need a one-off scan to delete the finalized payloads from the canonical chain. This is implemented in `try_prune_execution_payloads` which runs on startup and scans the chain back to the Bellatrix fork or the anchor slot (if checkpoint synced after Bellatrix). In the case where payloads are already pruned this check only imposes a single state load for the split state, which shouldn't be _too slow_. Even so, a flag `--prepare-payloads-on-startup=false` is provided to turn this off after it has run the first time, which provides faster start-up times.

There is also a new `lighthouse db prune_payloads` subcommand for users who prefer to run the pruning manually.

## Additional Info

The tests have been updated to not rely on finalized payloads in the database, instead using the `MockExecutionLayer` to reconstruct them. Additionally a check was added to `check_chain_dump` which asserts the non-existence or existence of payloads on disk depending on their slot.
2022-09-17 02:27:01 +00:00
Michael Sproul
5b2843c2cd Pre-allocate vectors in SSZ decoding (#3417)
## Issue Addressed

Fixes a potential regression in memory fragmentation identified by @paulhauner here: https://github.com/sigp/lighthouse/pull/3371#discussion_r931770045.

## Proposed Changes

Immediately allocate a vector with sufficient size to hold all decoded elements in SSZ decoding. The `size_hint` is derived from the range iterator here:

2983235650/consensus/ssz/src/decode/impls.rs (L489)

## Additional Info

I'd like to test this out on some infra for a substantial duration to see if it affects total fragmentation.
2022-09-16 11:54:17 +00:00
Paul Hauner
b0b606dabe Use SmallVec for TreeHash packed encoding (#3581)
## Issue Addressed

NA

## Proposed Changes

I've noticed that our block hashing times increase significantly after the merge. I did some flamegraph-ing and noticed that we're allocating a `Vec` for each byte of each execution payload transaction. This seems like unnecessary work and a bit of a fragmentation risk.

This PR switches to `SmallVec<[u8; 32]>` for the packed encoding of `TreeHash`. I believe this is a nice simple optimisation with no downside.

### Benchmarking

These numbers were computed using #3580 on my desktop (i7 hex-core). You can see a bit of noise in the numbers, that's probably just my computer doing other things. Generally I found this change takes the time from 10-11ms to 8-9ms. I can also see all the allocations disappear from flamegraph.

This is the block being benchmarked: https://beaconcha.in/slot/4704236

#### Before

```
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 980: 10.553003ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 981: 10.563737ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 982: 10.646352ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 983: 10.628532ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 984: 10.552112ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 985: 10.587778ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 986: 10.640526ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 987: 10.587243ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 988: 10.554748ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 989: 10.551111ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 990: 11.559031ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 991: 11.944827ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 992: 10.554308ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 993: 11.043397ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 994: 11.043315ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 995: 11.207711ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 996: 11.056246ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 997: 11.049706ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 998: 11.432449ms
[2022-09-15T21:44:19Z INFO  lcli::block_root] Run 999: 11.149617ms
```

#### After

```
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 980: 14.011653ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 981: 8.925314ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 982: 8.849563ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 983: 8.893689ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 984: 8.902964ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 985: 8.942067ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 986: 8.907088ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 987: 9.346101ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 988: 8.96142ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 989: 9.366437ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 990: 9.809334ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 991: 9.541561ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 992: 11.143518ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 993: 10.821181ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 994: 9.855973ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 995: 10.941006ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 996: 9.596155ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 997: 9.121739ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 998: 9.090019ms
[2022-09-15T21:41:49Z INFO  lcli::block_root] Run 999: 9.071885ms
```

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.
2022-09-16 08:54:06 +00:00
Paul Hauner
bde3c168e2 Add lcli block-root tool (#3580)
## Issue Addressed

NA

## Proposed Changes

Adds a simple tool for computing the block root of some block from a beacon-API or a file. This is useful for benchmarking.

## Additional Info

NA
2022-09-16 08:54:04 +00:00
Paul Hauner
2cd3e3a768 Avoid duplicate committee cache loads (#3574)
## Issue Addressed

NA

## Proposed Changes

I have observed scenarios on Goerli where Lighthouse was receiving attestations which reference the same, un-cached shuffling on multiple threads at the same time. Lighthouse was then loading the same state from database and determining the shuffling on multiple threads at the same time. This is unnecessary load on the disk and RAM.

This PR modifies the shuffling cache so that each entry can be either:

- A committee
- A promise for a committee (i.e., a `crossbeam_channel::Receiver`)

Now, in the scenario where we have thread A and thread B simultaneously requesting the same un-cached shuffling, we will have the following:

1. Thread A will take the write-lock on the shuffling cache, find that there's no cached committee and then create a "promise" (a `crossbeam_channel::Sender`) for a committee before dropping the write-lock.
1. Thread B will then be allowed to take the write-lock for the shuffling cache and find the promise created by thread A. It will block the current thread waiting for thread A to fulfill that promise.
1. Thread A will load the state from disk, obtain the shuffling, send it down the channel, insert the entry into the cache and then continue to verify the attestation.
1. Thread B will then receive the shuffling from the receiver, be un-blocked and then continue to verify the attestation.

In the case where thread A fails to generate the shuffling and drops the sender, the next time that specific shuffling is requested we will detect that the channel is disconnected and return a `None` entry for that shuffling. This will cause the shuffling to be re-calculated.

## Additional Info

NA
2022-09-16 08:54:03 +00:00
Paul Hauner
7d3948c8fe Add metric for re-org distance (#3566)
## Issue Addressed

NA

## Proposed Changes

Add a metric to track the re-org distance.

## Additional Info

NA
2022-09-13 17:19:27 +00:00
Michael Sproul
cd31e54b99 Bump axum deps (#3570)
## Issue Addressed

Fix a `cargo-audit` failure. We don't use `axum` for anything besides tests, but `cargo-audit` is failing due to this vulnerability in `axum-core`: https://rustsec.org/advisories/RUSTSEC-2022-0055
2022-09-13 01:57:47 +00:00
realbigsean
614d74a6d4 Fix builder gas limit docs (#3569)
## Issue Addressed

Make sure gas limit examples in our docs represent sane values.

Thanks @dankrad for raising this in discord.

## Additional Info

We could also consider logging warnings about whether the gas limits configured are sane. Prysm has an open issue for this: https://github.com/prysmaticlabs/prysm/issues/10810


Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-09-13 01:57:46 +00:00
Alessandro Tagliapietra
88a7e5a2ca Fix ganache test endpoint for ipv6 machines (#3563)
## Issue Addressed

#3562

## Proposed Changes

Change the fork endpoint from `localhost` to `127.0.0.1` to match the ganache default listening host.
This way it doesn't try (and fail) to connect to `::1` on IPV6 machines.

## Additional Info

First PR
2022-09-13 01:57:45 +00:00
tim gretler
98815516a1 Support histogram buckets (#3391)
## Issue Addressed

#3285

## Proposed Changes

Adds support for specifying histogram with buckets and adds new metric buckets for metrics mentioned in issue.

## Additional Info

Need some help for the buckets.


Co-authored-by: Michael Sproul <micsproul@gmail.com>
2022-09-13 01:57:44 +00:00
Rémy Roy
cfa518ab41 Use generic domain for community checkpoint sync example (#3560)
## Proposed Changes

Use a generic domain for community checkpoint sync example to meet the concern raised in https://github.com/sigp/lighthouse/pull/3558#discussion_r966720171
2022-09-10 01:35:11 +00:00
Nils Effinghausen
f682df51a1 fix description for BALANCES_CACHE_MISSES metric (#3545)
## Issue Addressed

fixes metric description


Co-authored-by: Nils Effinghausen <nils.effinghausen@t-systems.com>
2022-09-10 01:35:10 +00:00
Rémy Roy
60e9777db8 Add community checkpoint sync endpoints to book (#3558)
## Proposed Changes

Add a section on the new community checkpoint sync endpoints in the book. This should help stakers sync faster even without having to create an account.
2022-09-09 02:52:35 +00:00
realbigsean
d1a8d6cf91 Pin mev rs deps (#3557)
## Issue Addressed

We were unable to update lighthouse by running `cargo update` because some of the `mev-build-rs` deps weren't pinned. But `mev-build-rs` is now pinned here and includes it's own pinned commits for `ssz-rs` and `etheruem-consensus`



Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-09-08 23:46:03 +00:00
realbigsean
a9f075c3c0 Remove strict fee recipient (#3552)
## Issue Addressed

Resolves: #3550

Remove the `--strict-fee-recipient` flag. It will cause missed proposals prior to the bellatrix transition.

Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-09-08 23:46:02 +00:00
realbigsean
81d078bfc7 remove strict fee recipient docs (#3551)
## Issue Addressed

Related: #3550

Remove references to the `--strict-fee-recipient` flag in docs. The flag will cause missed proposals prior to the merge transition.



Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-09-08 00:06:25 +00:00
Alexander Cyon
419c53bf24 Add flag 'log-color' preserving color of log redirected to file. (#3538)
Add flag 'log-color' which preserves colors of log when stdout is redirected to a file.

This is my first lighthouse PR, please let me know if I'm not following contribution guidelines, I welcome meta-feeback (feedback on git commit messages, git branch naming, and the contents of the description of this PR.)

## Issue Addressed

Solves https://github.com/sigp/lighthouse/issues/3527

## Proposed Changes

Adding a flag which enables log color preserving when stdout is redirected to a file.

### Usage
Below I demonstrate current behaviour (without using the new flag) and the new behaviur (when using new flag).

In the screenshot below, I have to panes, one on top running `lighthouse` which redirects to file `~/Desktop/test.log` and one pane in the bottom which runs `tail -f ~/Desktop/test.log`.

#### Current behaviour
```sh
lighthouse --network prater vc |& tee -a ~/Desktop/test.log
```

**Result is no colors**

<img width="1624" alt="current" src="https://user-images.githubusercontent.com/864410/188258226-bfcf8271-4c9e-474c-848e-ac92a60df25c.png">


#### New behaviour
```sh
lighthouse --network prater vc --log-color |& tee -a ~/Desktop/test.log
```

**Result is colors** 🔴🟢🔵🟡

<img width="1624" alt="new" src="https://user-images.githubusercontent.com/864410/188258223-7d9ecf09-92c8-4cba-8f24-bd4d88fc0353.png">

## Additional Info

I chose American spelling of "color" instead of Brittish "colour' since that was aligned with `slog`'s API - method`force_color()`, let me know if you prefer spelling "colour" instead. I also chose to let it be an arg not taking any argument, just like `logfile-compress` flag, rather than having to write `--log-color true`.
2022-09-06 05:58:27 +00:00
ZZ
528e150e53 Update graffiti.md (#3537)
fix typo
2022-09-05 08:29:02 +00:00
Michael Sproul
9a7f7f1c1e Configurable monitoring endpoint frequency (#3530)
## Issue Addressed

Closes #3514

## Proposed Changes

- Change default monitoring endpoint frequency to 120 seconds to fit with 30k requests/month limit.
- Allow configuration of the monitoring endpoint frequency using `--monitoring-endpoint-frequency N` where `N` is a value in seconds.
2022-09-05 08:29:00 +00:00
realbigsean
177aef8f1e Builder profit threshold flag (#3534)
## Issue Addressed

Resolves https://github.com/sigp/lighthouse/issues/3517

## Proposed Changes

Adds a `--builder-profit-threshold <wei value>` flag to the BN. If an external payload's value field is less than this value, the local payload will be used. The value of the local payload will not be checked (it can't really be checked until the engine API is updated to support this).


Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-09-05 04:50:49 +00:00
omahs
95c56630a6 Fixing a few typos / documentation (#3531)
Fixing a few typos in the documentation
2022-09-05 04:50:48 +00:00
realbigsean
cae40731a2 Strict count unrealized (#3522)
## Issue Addressed

Add a flag that can increase count unrealized strictness, defaults to false

## Proposed Changes

Please list or describe the changes introduced by this PR.

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: sean <seananderson33@gmail.com>
2022-09-05 04:50:47 +00:00
MaboroshiChan
f13dd04f42 Add timeout for --checkpoint-sync-url (#3521)
## Issue Addressed

[Have --checkpoint-sync-url timeout](https://github.com/sigp/lighthouse/issues/3478)

## Proposed Changes

I added a parameter for `get_bytes_opt_accept_header<U: IntoUrl>` which accept a timeout duration, and modified the body of `get_beacon_blocks_ssz` and `get_debug_beacon_states_ssz` to pass corresponding timeout durations.
2022-09-05 04:50:46 +00:00
Mac L
80359d8ddb Fix attestation performance API InvalidValidatorIndex error (#3503)
## Issue Addressed

When requesting an index which is not active during `start_epoch`, Lighthouse returns: 
```
curl "http://localhost:5052/lighthouse/analysis/attestation_performance/999999999?start_epoch=100000&end_epoch=100000"
```
```json
{
  "code": 500,
  "message": "INTERNAL_SERVER_ERROR: ParticipationCache(InvalidValidatorIndex(999999999))",
  "stacktraces": []
}
```

This error occurs even when the index in question becomes active before `end_epoch` which is undesirable as it can prevent larger queries from completing.

## Proposed Changes

In the event the index is out-of-bounds (has not yet been activated), simply return all fields as `false`:

```
-> curl "http://localhost:5052/lighthouse/analysis/attestation_performance/999999999?start_epoch=100000&end_epoch=100000"
```
```json
[
  {
    "index": 999999999,
    "epochs": {
      "100000": {
        "active": false,
        "head": false,
        "target": false,
        "source": false
      }
    }
  }
]
```

By doing this, we cover the case where a validator becomes active sometime between `start_epoch` and `end_epoch`.

## Additional Info

Note that this error only occurs for epochs after the Altair hard fork.
2022-09-05 04:50:45 +00:00
Divma
473abc14ca Subscribe to subnets only when needed (#3419)
## Issue Addressed

We currently subscribe to attestation subnets as soon as the subscription arrives (one epoch in advance), this makes it so that subscriptions for future slots are scheduled instead of done immediately. 

## Proposed Changes

- Schedule subscriptions to subnets for future slots.
- Finish removing hashmap_delay, in favor of [delay_map](https://github.com/AgeManning/delay_map). This was the only remaining service to do this.
- Subscriptions for past slots are rejected, before we would subscribe for one slot.
- Add a new test for subscriptions that are not consecutive.

## Additional Info

This is also an effort in making the code easier to understand
2022-09-05 00:22:48 +00:00
Paul Hauner
aa022f4685 v3.1.0 (#3525)
## Issue Addressed

NA

## Proposed Changes

- Bump versions

## Additional Info

- ~~Blocked on #3508~~
- ~~Blocked on #3526~~
- ~~Requires additional testing.~~
- Expected release date is 2022-09-01
2022-08-31 22:21:55 +00:00
Pawan Dhananjay
c5785887a9 Log fee recipients in VC (#3526)
## Issue Addressed

Resolves #3524 

## Proposed Changes

Log fee recipient in the `Validator exists in beacon chain` log. Logging in the BN already happens here 18c61a5e8b/beacon_node/beacon_chain/src/beacon_chain.rs (L3858-L3865)

I also think it's good practice to encourage users to set the fee recipient in the VC rather than the BN because of issues mentioned here https://github.com/sigp/lighthouse/issues/3432

Some example logs from prater:
```
Aug 30 03:47:09.922 INFO Validator exists in beacon chain        fee_recipient: 0xab97_ad88, validator_index: 213615, pubkey: 0xb542b69ba14ddbaf717ca1762ece63a4804c08d38a1aadf156ae718d1545942e86763a1604f5065d4faa550b7259d651, service: duties


Aug 30 03:48:05.505 INFO Validator exists in beacon chain        fee_recipient: Fee recipient for validator not set in validator_definitions.yml or provided with the `--suggested-fee-recipient flag`, validator_index: 210710, pubkey: 0xad5d67cc7f990590c7b3fa41d593c4cf12d9ead894be2311fbb3e5c733d8c1b909e9d47af60ea3480fb6b37946c35390, service: duties
```


Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-08-30 05:47:32 +00:00
Paul Hauner
661307dce1 Separate committee subscriptions queue (#3508)
## Issue Addressed

NA

## Proposed Changes

As we've seen on Prater, there seems to be a correlation between these messages

```
WARN Not enough time for a discovery search  subnet_id: ExactSubnet { subnet_id: SubnetId(19), slot: Slot(3742336) }, service: attestation_service
```

... and nodes falling 20-30 slots behind the head for short periods. These nodes are running ~20k Prater validators.

After running some metrics, I can see that the `network_recv` channel is processing ~250k `AttestationSubscribe` messages per minute. It occurred to me that perhaps the `AttestationSubscribe` messages are "washing out" the `SendRequest` and `SendResponse` messages. In this PR I separate the `AttestationSubscribe` and `SyncCommitteeSubscribe` messages into their own queue so the `tokio::select!` in the `NetworkService` can still process the other messages in the `network_recv` channel without necessarily having to clear all the subscription messages first.

~~I've also added filter to the HTTP API to prevent duplicate subscriptions going to the network service.~~

## Additional Info

- Currently being tested on Prater
2022-08-30 05:47:31 +00:00
realbigsean
ebd0e0e2d9 Docker builds in GitHub actions (#3523)
## Issue Addressed

I think the antithesis is failing due to an OOM which may be resolved by updating the ubuntu image it runs on. The lcli build looks like it's failing because the image lacks the `libclang` dependency



Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-08-29 18:31:27 +00:00
Michael Sproul
7a50684741 Harden slot notifier against clock drift (#3519)
## Issue Addressed

Partly resolves #3518

## Proposed Changes

Change the slot notifier to use `duration_to_next_slot` rather than an interval timer. This makes it robust against underlying clock changes.
2022-08-29 14:34:43 +00:00
Paul Hauner
1a833ecc17 Add more logging for invalid payloads (#3515)
## Issue Addressed

NA

## Proposed Changes

Adds more `debug` logging to help troubleshoot invalid execution payload blocks. I was doing some of this recently and found it to be challenging.

With this PR we should be able to grep `Invalid execution payload` and get one-liners that will show the block, slot and details about the proposer.

I also changed the log in `process_invalid_execution_payload` since it was a little misleading; the `block_root` wasn't necessary the block which had an invalid payload.

## Additional Info

NA
2022-08-29 14:34:42 +00:00
Paul Hauner
8609cced0e Reset payload statuses when resuming fork choice (#3498)
## Issue Addressed

NA

## Proposed Changes

This PR is motivated by a recent consensus failure in Geth where it returned `INVALID` for an `VALID` block. Without this PR, the only way to recover is by re-syncing Lighthouse. Whilst ELs "shouldn't have consensus failures", in reality it's something that we can expect from time to time due to the complex nature of Ethereum. Being able to recover easily will help the network recover and EL devs to troubleshoot.

The risk introduced with this PR is that genuinely INVALID payloads get a "second chance" at being imported. I believe the DoS risk here is negligible since LH needs to be restarted in order to re-process the payload. Furthermore, there's no reason to think that a well-performing EL will accept a truly invalid payload the second-time-around.

## Additional Info

This implementation has the following intricacies:

1. Instead of just resetting *invalid* payloads to optimistic, we'll also reset *valid* payloads. This is an artifact of our existing implementation.
1. We will only reset payload statuses when we detect an invalid payload present in `proto_array`
    - This helps save us from forgetting that all our blocks are valid in the "best case scenario" where there are no invalid blocks.
1. If we fail to revert the payload statuses we'll log a `CRIT` and just continue with a `proto_array` that *does not* have reverted payload statuses.
    - The code to revert statuses needs to deal with balances and proposer-boost, so it's a failure point. This is a defensive measure to avoid introducing new show-stopping bugs to LH.
2022-08-29 14:34:41 +00:00
realbigsean
2ce86a0830 Validator registration request failures do not cause us to mark BNs offline (#3488)
## Issue Addressed

Relates to https://github.com/sigp/lighthouse/issues/3416

## Proposed Changes

- Add an `OfflineOnFailure` enum to the `first_success` method for querying beacon nodes so that a val registration request failure from the BN -> builder does not result in the BN being marked offline. This seems important because these failures could be coming directly from a connected relay and actually have no bearing on BN health.  Other messages that are sent to a relay have a local fallback so shouldn't result in errors 

- Downgrade the following log to a `WARN`

```
ERRO Unable to publish validator registrations to the builder network, error: All endpoints failed https://BN_B => RequestFailed(ServerMessage(ErrorMessage { code: 500, message: "UNHANDLED_ERROR: BuilderMissing", stacktraces: [] })), https://XXXX/ => Unavailable(Offline), [omitted]
```

## Additional Info

I think this change at least improves the UX of having a VC connected to some builder and some non-builder beacon nodes. I think we need to balance potentially alerting users that there is a BN <> VC misconfiguration and also allowing this type of fallback to work. 

If we want to fully support this type of configuration we may want to consider adding a flag `--builder-beacon-nodes` and track whether a VC should be making builder queries on a per-beacon node basis.  But I think the changes in this PR are independent of that type of extension.

PS: Sorry for the big diff here, it's mostly formatting changes after I added a new arg to a bunch of methods calls.




Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-08-29 11:35:59 +00:00
Michael Sproul
66eca1a882 Refactor op pool for speed and correctness (#3312)
## Proposed Changes

This PR has two aims: to speed up attestation packing in the op pool, and to fix bugs in the verification of attester slashings, proposer slashings and voluntary exits. The changes are bundled into a single database schema upgrade (v12).

Attestation packing is sped up by removing several inefficiencies: 

- No more recalculation of `attesting_indices` during packing.
- No (unnecessary) examination of the `ParticipationFlags`: a bitfield suffices. See `RewardCache`.
- No re-checking of attestation validity during packing: the `AttestationMap` provides attestations which are "correct by construction" (I have checked this using Hydra).
- No SSZ re-serialization for the clunky `AttestationId` type (it can be removed in a future release).

So far the speed-up seems to be roughly 2-10x, from 500ms down to 50-100ms.

Verification of attester slashings, proposer slashings and voluntary exits is fixed by:

- Tracking the `ForkVersion`s that were used to verify each message inside the `SigVerifiedOp`. This allows us to quickly re-verify that they match the head state's opinion of what the `ForkVersion` should be at the epoch(s) relevant to the message.
- Storing the `SigVerifiedOp` on disk rather than the raw operation. This allows us to continue track the fork versions after a reboot.

This is mostly contained in this commit 52bb1840ae5c4356a8fc3a51e5df23ed65ed2c7f.

## Additional Info

The schema upgrade uses the justified state to re-verify attestations and compute `attesting_indices` for them. It will drop any attestations that fail to verify, by the logic that attestations are most valuable in the few slots after they're observed, and are probably stale and useless by the time a node restarts. Exits and proposer slashings and similarly re-verified to obtain `SigVerifiedOp`s.

This PR contains a runtime killswitch `--paranoid-block-proposal` which opts out of all the optimisations in favour of closely verifying every included message. Although I'm quite sure that the optimisations are correct this flag could be useful in the event of an unforeseen emergency.

Finally, you might notice that the `RewardCache` appears quite useless in its current form because it is only updated on the hot-path immediately before proposal. My hope is that in future we can shift calls to `RewardCache::update` into the background, e.g. while performing the state advance. It is also forward-looking to `tree-states` compatibility, where iterating and indexing `state.{previous,current}_epoch_participation` is expensive and needs to be minimised.
2022-08-29 09:10:26 +00:00
Michael Sproul
1c9ec42dcb More merge doc updates (#3509)
## Proposed Changes

Address a few shortcomings of the book noticed by users:

- Remove description of redundant execution nodes
- Use an Infura eth1 node rather than an eth2 node in the merge migration example
- Add an example of the fee recipient address format (we support addresses without the 0x prefix, but 0x prefixed feels more canonical).
- Clarify that Windows support is no longer beta
- Add a link to the MSRV to the build-from-source instructions
2022-08-26 21:47:50 +00:00
Paul Hauner
c64e17bb81 Return readonly: false for local keystores (#3490)
## Issue Addressed

NA

## Proposed Changes

Indicate that local keystores are `readonly: Some(false)` rather than `None` via the `/eth/v1/keystores` method on the VC API.

I'll mark this as backwards-incompat so we remember to mention it in the release notes. There aren't any type-level incompatibilities here, just a change in how Lighthouse responds to responses.

## Additional Info

- Blocked on #3464
2022-08-24 23:35:00 +00:00
Paul Hauner
ebd661783e Enable block_lookup_failed EF test (#3489)
## Issue Addressed

Resolves #3448

## Proposed Changes

Removes a known failure that wasn't actually a known failure. The tests declare this block invalid and we refuse to import it due to `ExecutionPayloadError(UnverifiedNonOptimisticCandidate)`.

This is correct since there is only one "eth1" block included in this test and two are required to trigger the merge (pre- and post-TTD blocks). It is slot 1 (tick = 12s) when this block is imported so the import must be prevented by `SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY`.

I'm not sure where I got the idea in #3448 that this test needed retrospective checking, that seems like a false assumption in hindsight.

## Additional Info

- Blocked on #3464
2022-08-24 23:34:59 +00:00
realbigsean
cb132c622d don't register exited or slashed validators with the builder api (#3473)
## Issue Addressed

#3465

## Proposed Changes

Filter out any validator registrations for validators that are not `active` or `pending`.  I'm adding this filtering the beacon node because all the information is readily available there. In other parts of the VC we are usually sending per-validator requests based on duties from the BN. And duties will only be provided for active validators so we don't have this type of filtering elsewhere in the VC.



Co-authored-by: realbigsean <sean@sigmaprime.io>
2022-08-24 23:34:58 +00:00
Divma
8c69d57c2c Pause sync when EE is offline (#3428)
## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2022-08-24 23:34:56 +00:00
Michael Sproul
aab4a8d2f2 Update docs for mainnet merge release (#3494)
## Proposed Changes

Update the merge migration docs to encourage updating mainnet configs _now_!

The docs are also updated to recommend _against_ `--suggested-fee-recipient` on the beacon node (https://github.com/sigp/lighthouse/issues/3432).

Additionally the `--help` for the CLI is updated to match with a few small semantic changes:

- `--execution-jwt` is no longer allowed without `--execution-endpoint`. We've ended up without a default for `--execution-endpoint`, so I think that's fine.
- The flags related to the JWT are only allowed if `--execution-jwt` is provided.
2022-08-23 03:50:58 +00:00
Paul Hauner
18c61a5e8b v3.0.0 (#3464)
## Issue Addressed

NA

## Proposed Changes

Bump versions to v3.0.0

## Additional Info

- ~~Blocked on #3439~~
- ~~Blocked on #3459~~
- ~~Blocked on #3463~~
- ~~Blocked on #3462~~
- ~~Requires further testing~~


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2022-08-22 03:43:08 +00:00
Paul Hauner
931153885c Run per-slot fork choice at a further distance from the head (#3487)
## Issue Addressed

NA

## Proposed Changes

Run fork choice when the head is 256 slots from the wall-clock slot, rather than 4.

The reason we don't *always* run FC is so that it doesn't slow us down during sync. As the comments state, setting the value to 256 means that we'd only have one interrupting fork-choice call if we were syncing at 20 slots/sec.

## Additional Info

NA
2022-08-19 04:27:24 +00:00
Paul Hauner
df358b864d Add metrics for EE PayloadStatus returns (#3486)
## Issue Addressed

NA

## Proposed Changes

Adds some metrics so we can track payload status responses from the EE. I think this will be useful for troubleshooting and alerting.

I also bumped the `BecaonChain::per_slot_task` to `debug` since it doesn't seem too noisy and would have helped us with some things we were debugging in the past.

## Additional Info

NA
2022-08-19 04:27:23 +00:00
Paul Hauner
043fa2153e Revise EE peer penalites (#3485)
## Issue Addressed

NA

## Proposed Changes

Don't penalize peers for errors that might be caused by an honest optimistic node.

## Additional Info

NA
2022-08-19 04:27:22 +00:00
Paul Hauner
a0605c4ee6 Bump EF tests to v1.2.0 rc.3 (#3483)
## Issue Addressed

NA

## Proposed Changes

Bumps test vectors and ignores another weird MacOS file.

## Additional Info

NA
2022-08-19 04:27:21 +00:00