It is a well-known fact that IP addresses for beacon nodes used by specific validators can be de-anonymized. There is an assumed risk that a malicious user may attempt to DOS validators when producing blocks to prevent chain growth/liveness.
Although there are a number of ideas put forward to address this, there a few simple approaches we can take to mitigate this risk.
Currently, a Lighthouse user is able to set a number of beacon-nodes that their validator client can connect to. If one beacon node is taken offline, it can fallback to another. Different beacon nodes can use VPNs or rotate IPs in order to mask their IPs.
This PR provides an additional setup option which further mitigates attacks of this kind.
This PR introduces a CLI flag --proposer-only to the beacon node. Setting this flag will configure the beacon node to run with minimal peers and crucially will not subscribe to subnets or sync committees. Therefore nodes of this kind should not be identified as nodes connected to validators of any kind.
It also introduces a CLI flag --proposer-nodes to the validator client. Users can then provide a number of beacon nodes (which may or may not run the --proposer-only flag) that the Validator client will use for block production and propagation only. If these nodes fail, the validator client will fallback to the default list of beacon nodes.
Users are then able to set up a number of beacon nodes dedicated to block proposals (which are unlikely to be identified as validator nodes) and point their validator clients to produce blocks on these nodes and attest on other beacon nodes. An attack attempting to prevent liveness on the eth2 network would then need to preemptively find and attack the proposer nodes which is significantly more difficult than the default setup.
This is a follow on from: #3328
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
There was a [`VecDeque` bug](https://github.com/rust-lang/rust/issues/108453) in some recent versions of the Rust standard library (1.67.0 & 1.67.1) that could cause Lighthouse to panic (reported by `@Sea Monkey` on discord). See full logs below.
The issue was likely introduced in Rust 1.67.0 and [fixed](https://github.com/rust-lang/rust/pull/108475) in 1.68, and we were able to reproduce the panic ourselves using [@michaelsproul's fuzz tests](https://github.com/michaelsproul/lighthouse/blob/fuzz-lru-time-cache/beacon_node/lighthouse_network/src/peer_manager/fuzz.rs#L111) on both Rust 1.67.0 and 1.67.1.
Users that uses our Docker images or binaries are unlikely affected, as our Docker images were built with `1.66`, and latest binaries were built with latest stable (`1.68.2`). It likely impacts user that builds from source using Rust versions 1.67.x.
## Proposed Changes
Bump Rust version (MSRV) to latest stable `1.68.2`.
## Additional Info
From `@Sea Monkey` on Lighthouse Discord:
> Crash on goerli using `unstable` `dd124b2d6804d02e4e221f29387a56775acccd08`
```
thread 'tokio-runtime-worker' panicked at 'Key must exist', /mnt/goerli/goerli/lighthouse/common/lru_cache/src/time.rs:68:28
stack backtrace:
Apr 15 09:37:36.993 WARN Peer sent invalid block in single block lookup, peer_id: 16Uiu2HAm6ZuyJpVpR6y51X4Enbp8EhRBqGycQsDMPX7e5XfPYznG, error: WouldRevertFinalizedSlot { block_slot: Slot(5420212), finalized_slot: Slot(5420224) }, root: 0x10f6…3165, service: sync
0: rust_begin_unwind
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
2: core::panicking::panic_display
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:135:5
3: core::panicking::panic_str
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:119:5
4: core::option::expect_failed
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:1879:5
5: lru_cache::time::LRUTimeCache<Key>::raw_remove
6: lighthouse_network::peer_manager::PeerManager<TSpec>::handle_ban_operation
7: lighthouse_network::peer_manager::PeerManager<TSpec>::handle_score_action
8: lighthouse_network::peer_manager::PeerManager<TSpec>::report_peer
9: network::service::NetworkService<T>::spawn_service::{{closure}}
10: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
11: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
12: <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll
13: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
14: tokio::runtime::task::core::Core<T,S>::poll
15: tokio::runtime::task::harness::Harness<T,S>::poll
16: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
17: tokio::runtime::scheduler::multi_thread::worker::Context::run
18: tokio::macros::scoped_tls::ScopedKey<T>::set
19: tokio::runtime::scheduler::multi_thread::worker::run
20: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
21: tokio::runtime::task::core::Core<T,S>::poll
22: tokio::runtime::task::harness::Harness<T,S>::poll
23: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Apr 15 09:37:37.069 INFO Saved DHT state service: network
Apr 15 09:37:37.070 INFO Network service shutdown service: network
Apr 15 09:37:37.132 CRIT Task panic. This is a bug! advice: Please check above for a backtrace and notify the developers, message: <none>, task_name: network
Apr 15 09:37:37.132 INFO Internal shutdown received reason: Panic (fatal error)
Apr 15 09:37:37.133 INFO Shutting down.. reason: Failure("Panic (fatal error)")
Apr 15 09:37:37.135 WARN Unable to free worker error: channel closed, msg: did not free worker, shutdown may be underway
Apr 15 09:37:39.350 INFO Saved beacon chain to disk service: beacon
Panic (fatal error)
```
## Proposed Changes
Builds on #4028 to use the new payload bodies methods in the HTTP API as well.
## Caveats
The payloads by range method only works for the finalized chain, so it can't be used in the execution engine integration tests because we try to reconstruct unfinalized payloads there.
## Issue Addressed
Attempt to fix#3812
## Proposed Changes
Move web3signer binary download script out of build script to avoid downloading unless necessary. If this works, it should also reduce the build time for all jobs that runs compilation.
## Issue Addressed
#3212
## Proposed Changes
- Introduce a new `rate_limiting_backfill_queue` - any new inbound backfill work events gets immediately sent to this FIFO queue **without any processing**
- Spawn a `backfill_scheduler` routine that pops a backfill event from the FIFO queue at specified intervals (currently halfway through a slot, or at 6s after slot start for 12s slots) and sends the event to `BeaconProcessor` via a `scheduled_backfill_work_tx` channel
- This channel gets polled last in the `InboundEvents`, and work event received is wrapped in a `InboundEvent::ScheduledBackfillWork` enum variant, which gets processed immediately or queued by the `BeaconProcessor` (existing logic applies from here)
Diagram comparing backfill processing with / without rate-limiting:
https://github.com/sigp/lighthouse/issues/3212#issuecomment-1386249922
See this comment for @paulhauner's explanation and solution: https://github.com/sigp/lighthouse/issues/3212#issuecomment-1384674956
## Additional Info
I've compared this branch (with backfill processing rate limited to to 1 and 3 batches per slot) against the latest stable version. The CPU usage during backfill sync is reduced by ~5% - 20%, more details on this page:
https://hackmd.io/@jimmygchen/SJuVpJL3j
The above testing is done on Goerli (as I don't currently have hardware for Mainnet), I'm guessing the differences are likely to be bigger on mainnet due to block size.
### TODOs
- [x] Experiment with processing multiple batches per slot. (need to think about how to do this for different slot durations)
- [x] Add option to disable rate-limiting, enabed by default.
- [x] (No longer required now we're reusing the reprocessing queue) Complete the `backfill_scheduler` task when backfill sync is completed or not required
## Issue Addressed
NA
## Proposed Changes
- Implements https://github.com/ethereum/consensus-specs/pull/3290/
- Bumps `ef-tests` to [v1.3.0-rc.4](https://github.com/ethereum/consensus-spec-tests/releases/tag/v1.3.0-rc.4).
The `CountRealizedFull` concept has been removed and the `--count-unrealized-full` and `--count-unrealized` BN flags now do nothing but log a `WARN` when used.
## Database Migration Debt
This PR removes the `best_justified_checkpoint` from fork choice. This field is persisted on-disk and the correct way to go about this would be to make a DB migration to remove the field. However, in this PR I've simply stubbed out the value with a junk value. I've taken this approach because if we're going to do a DB migration I'd love to remove the `Option`s around the justified and finalized checkpoints on `ProtoNode` whilst we're at it. Those options were added in #2822 which was included in Lighthouse v2.1.0. The options were only put there to handle the migration and they've been set to `Some` ever since v2.1.0. There's no reason to keep them as options anymore.
I started adding the DB migration to this branch but I started to feel like I was bloating this rather critical PR with nice-to-haves. I've kept the partially-complete migration [over in my repo](https://github.com/paulhauner/lighthouse/tree/fc-pr-18-migration) so we can pick it up after this PR is merged.
## Issue Addressed
Add support for ipv6 and dual stack in lighthouse.
## Proposed Changes
From an user perspective, now setting an ipv6 address, optionally configuring the ports should feel exactly the same as using an ipv4 address. If listening over both ipv4 and ipv6 then the user needs to:
- use the `--listen-address` two times (ipv4 and ipv6 addresses)
- `--port6` becomes then required
- `--discovery-port6` can now be used to additionally configure the ipv6 udp port
### Rough list of code changes
- Discovery:
- Table filter and ip mode set to match the listening config.
- Ipv6 address, tcp port and udp port set in the ENR builder
- Reported addresses now check which tcp port to give to libp2p
- LH Network Service:
- Can listen over Ipv6, Ipv4, or both. This uses two sockets. Using mapped addresses is disabled from libp2p and it's the most compatible option.
- NetworkGlobals:
- No longer stores udp port since was not used at all. Instead, stores the Ipv4 and Ipv6 TCP ports.
- NetworkConfig:
- Update names to make it clear that previous udp and tcp ports in ENR were Ipv4
- Add fields to configure Ipv6 udp and tcp ports in the ENR
- Include advertised enr Ipv6 address.
- Add type to model Listening address that's either Ipv4, Ipv6 or both. A listening address includes the ip, udp port and tcp port.
- UPnP:
- Kept only for ipv4
- Cli flags:
- `--listen-addresses` now can take up to two values
- `--port` will apply to ipv4 or ipv6 if only one listening address is given. If two listening addresses are given it will apply only to Ipv4.
- `--port6` New flag required when listening over ipv4 and ipv6 that applies exclusively to Ipv6.
- `--discovery-port` will now apply to ipv4 and ipv6 if only one listening address is given.
- `--discovery-port6` New flag to configure the individual udp port of ipv6 if listening over both ipv4 and ipv6.
- `--enr-udp-port` Updated docs to specify that it only applies to ipv4. This is an old behaviour.
- `--enr-udp6-port` Added to configure the enr udp6 field.
- `--enr-tcp-port` Updated docs to specify that it only applies to ipv4. This is an old behaviour.
- `--enr-tcp6-port` Added to configure the enr tcp6 field.
- `--enr-addresses` now can take two values.
- `--enr-match` updated behaviour.
- Common:
- rename `unused_port` functions to specify that they are over ipv4.
- add functions to get unused ports over ipv6.
- Testing binaries
- Updated code to reflect network config changes and unused_port changes.
## Additional Info
TODOs:
- use two sockets in discovery. I'll get back to this and it's on https://github.com/sigp/discv5/pull/160
- lcli allow listening over two sockets in generate_bootnodes_enr
- add at least one smoke flag for ipv6 (I have tested this and works for me)
- update the book
## Proposed Changes
Two tiny updates to satisfy Clippy 1.68
Plus refactoring of the `http_api` into less complex types so the compiler can chew and digest them more easily.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
As discovered in #4034, Lighthouse is not accepting `latest_valid_hash == None` in an `INVALID` response to `newPayload`. The `null`/`None` response *was* illegal at one point, however it was added in https://github.com/ethereum/execution-apis/pull/254.
This PR brings Lighthouse in line with the standard and should fix the root cause of what #4034 patched around.
## Additional Info
NA
## Proposed Changes
Remove built-in support for Ropsten and Kiln via the `--network` flag. Both testnets are long dead and deprecated.
This shaves about 30MiB off the binary size, from 135MiB to 103MiB (maxperf), or 165MiB to 135MiB (release).
## Issue Addressed
Cleans up all the remnants of 4844 in capella. This makes sure when 4844 is reviewed there is nothing we are missing because it got included here
## Proposed Changes
drop a bomb on every 4844 thing
## Additional Info
Merge process I did (locally) is as follows:
- squash merge to produce one commit
- in new branch off unstable with the squashed commit create a `git revert HEAD` commit
- merge that new branch onto 4844 with `--strategy ours`
- compare local 4844 to remote 4844 and make sure the diff is empty
- enjoy
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
NA
## Proposed Changes
Updates our `ef_tests` to use: https://github.com/ethereum/consensus-specs/releases/tag/v1.3.0-rc.3
This required:
- Skipping a `merkle_proof_validity` test (see #4022)
- Account for the `eip4844` tests changing name to `deneb`
- My IDE did some Python linting during this change. It seemed simple and nice so I left it there.
## Additional Info
NA
## Proposed Changes
* Bump Go from 1.17 to 1.20. The latest Geth release v1.11.0 requires 1.18 minimum.
* Prevent a cache miss during payload building by using the right fee recipient. This prevents Geth v1.11.0 from building a block with 0 transactions. The payload building mechanism is overhauled in the new Geth to improve the payload every 2s, and the tests were failing because we were falling back on a `getPayload` call with no lookahead due to `get_payload_id` cache miss caused by the mismatched fee recipient. Alternatively we could hack the tests to send `proposer_preparation_data`, but I think the static fee recipient is simpler for now.
* Add support for optionally enabling Lighthouse logs in the integration tests. Enable using `cargo run --release --features logging/test_logger`. This was very useful for debugging.
## Proposed Changes
Decouple the stdout and logfile formats by adding the `--logfile-format` CLI flag.
This behaves identically to the existing `--log-format` flag, but instead will only affect the logs written to the logfile.
The `--log-format` flag will no longer have any effect on the contents of the logfile.
## Additional Info
This avoids being a breaking change by causing `logfile-format` to default to the value of `--log-format` if it is not provided.
This means that users who were previously relying on being able to use a JSON formatted logfile will be able to continue to use `--log-format JSON`.
Users who want to use JSON on stdout and default logs in the logfile, will need to pass the following flags: `--log-format JSON --logfile-format DEFAULT`
* add historical summaries
* fix tree hash caching, disable the sanity slots test with fake crypto
* add ssz static HistoricalSummary
* only store historical summaries after capella
* Teach `UpdatePattern` about Capella
* Tidy EF tests
* Clippy
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Recent discussions with other client devs about optimistic sync have revealed a conceptual issue with the optimisation implemented in #3738. In designing that feature I failed to consider that the execution node checks the `blockHash` of the execution payload before responding with `SYNCING`, and that omitting this check entirely results in a degradation of the full node's validation. A node omitting the `blockHash` checks could be tricked by a supermajority of validators into following an invalid chain, something which is ordinarily impossible.
## Proposed Changes
I've added verification of the `payload.block_hash` in Lighthouse. In case of failure we log a warning and fall back to verifying the payload with the execution client.
I've used our existing dependency on `ethers_core` for RLP support, and a new dependency on Parity's `triehash` crate for the Merkle patricia trie. Although the `triehash` crate is currently unmaintained it seems like our best option at the moment (it is also used by Reth, and requires vastly less boilerplate than Parity's generic `trie-root` library).
Block hash verification is pretty quick, about 500us per block on my machine (mainnet).
The optimistic finalized sync feature can be disabled using `--disable-optimistic-finalized-sync` which forces full verification with the EL.
## Additional Info
This PR also introduces a new dependency on our [`metastruct`](https://github.com/sigp/metastruct) library, which was perfectly suited to the RLP serialization method. There will likely be changes as `metastruct` grows, but I think this is a good way to start dogfooding it.
I took inspiration from some Parity and Reth code while writing this, and have preserved the relevant license headers on the files containing code that was copied and modified.
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/2327
## Proposed Changes
This is an extension of some ideas I implemented while working on `tree-states`:
- Cache the indexed attestations from blocks in the `ConsensusContext`. Previously we were re-computing them 3-4 times over.
- Clean up `import_block` by splitting each part into `import_block_XXX`.
- Move some stuff off hot paths, specifically:
- Relocate non-essential tasks that were running between receiving the payload verification status and priming the early attester cache. These tasks are moved after the cache priming:
- Attestation observation
- Validator monitor updates
- Slasher updates
- Updating the shuffling cache
- Fork choice attestation observation now happens at the end of block verification in parallel with payload verification (this seems to save 5-10ms).
- Payload verification now happens _before_ advancing the pre-state and writing it to disk! States were previously being written eagerly and adding ~20-30ms in front of verifying the execution payload. State catchup also sometimes takes ~500ms if we get a cache miss and need to rebuild the tree hash cache.
The remaining task that's taking substantial time (~20ms) is importing the block to fork choice. I _think_ this is because of pull-tips, and we should be able to optimise it out with a clever total active balance cache in the state (which would be computed in parallel with payload verification). I've decided to leave that for future work though. For now it can be observed via the new `beacon_block_processing_post_exec_pre_attestable_seconds` metric.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
#3704
## Proposed Changes
Adds is_syncing_finalized: bool parameter for block verification functions. Sets the payload_verification_status to Optimistic if is_syncing_finalized is true. Uses SyncState in NetworkGlobals in BeaconProcessor to retrieve the syncing status.
## Additional Info
I could implement FinalizedSignatureVerifiedBlock if you think it would be nicer.
## Issue Addressed
#3732
## Proposed Changes
Add a CLI flag to allow users to opt out of the restrictive permissions of the log files.
## Additional Info
This is not recommended for most users. The log files can contain sensitive information such as validator indices, public keys and API tokens (see #2438). However some users using a multi-user setup may find this helpful if they understand the risks involved.
## Issue Addressed
NA
## Proposed Changes
- Bump versions
- Pin the `nethermind` version since our method of getting the latest tags on `master` is giving us an old version (`1.14.1`).
- Increase timeout for execution engine startup.
## Additional Info
- [x] ~Awaiting further testing~
This PR adds some health endpoints for the beacon node and the validator client.
Specifically it adds the endpoint:
`/lighthouse/ui/health`
These are not entirely stable yet. But provide a base for modification for our UI.
These also may have issues with various platforms and may need modification.
## Issue Addressed
Part of https://github.com/sigp/lighthouse/issues/3651.
## Proposed Changes
Add a flag for enabling the light client server, which should be checked before gossip/RPC traffic is processed (e.g. https://github.com/sigp/lighthouse/pull/3693, https://github.com/sigp/lighthouse/pull/3711). The flag is available at runtime from `beacon_chain.config.enable_light_client_server`.
Additionally, a new method `BeaconChain::with_mutable_state_for_block` is added which I envisage being used for computing light client updates. Unfortunately its performance will be quite poor on average because it will only run quickly with access to the tree hash cache. Each slot the tree hash cache is only available for a brief window of time between the head block being processed and the state advance at 9s in the slot. When the state advance happens the cache is moved and mutated to get ready for the next slot, which makes it no longer useful for merkle proofs related to the head block. Rather than spend more time trying to optimise this I think we should continue prototyping with this code, and I'll make sure `tree-states` is ready to ship before we enable the light client server in prod (cf. https://github.com/sigp/lighthouse/pull/3206).
## Additional Info
I also fixed a bug in the implementation of `BeaconState::compute_merkle_proof` whereby the tree hash cache was moved with `.take()` but never put back with `.restore()`.
## Issue Addressed
This PR addresses partially #3651
## Proposed Changes
This PR adds the following methods:
* a new method to trait `TreeHash`, `hash_tree_leaves` which returns all the Merkle leaves of the ssz object.
* a new method to `BeaconState`: `compute_merkle_proof` which generates a specific merkle proof for given depth and index by using the `hash_tree_leaves` as leaves function.
## Additional Info
Now here is some rationale on why I decided to go down this route: adding a new function to commonly used trait is a pain but was necessary to make sure we have all merkle leaves for every object, that is why I just added `hash_tree_leaves` in the trait and not `compute_merkle_proof` as well. although it would make sense it gives us code duplication/harder review time and we just need it from one specific object in one specific usecase so not worth the effort YET. In my humble opinion.
Co-authored-by: Michael Sproul <micsproul@gmail.com>