In previous network updates we have made our libp2p connections more lean by limiting the maximum number of connections a lighthouse node will accept before libp2p rejects new connections.
However, we still maintain the logic that at maximum connections, we try to dial extra peers if they are needed by a validator client to publish messages on a specific subnet. The dials typically result in failures due the libp2p connection limits.
This PR adds an extra factor, `PRIORITY_PEER_EXCESS` which sets aside a new allocation of peers we are able to dial in case we need these peers for the VC client. This allocation sits along side the excess peer (which allows extra incoming peers on top of our target peer limit).
The drawback here, is that libp2p now allows extra peers to connect to us (beyond the standard peer limit) which the peer manager should subsequently reject.
This PR in general improves the handling around peer banning.
Specifically there were issues when multiple peers under a single IP connected to us after we banned the IP for poor behaviour.
This PR should now handle these peers gracefully as well as make some improvements around how we previously disconnected and banned peers.
The logic now goes as follows:
- Once a peer gets banned, its gets registered with its known IP addresses
- Once enough banned peers exist under a single IP that IP is banned
- We retain connections with existing peers under this IP
- Any new connections under this IP are rejected
## Issue Addressed
N/A
## Proposed Changes
Add a fork_digest to `ForkContext` only if it is set in the config.
Reject gossip messages on post fork topics before the fork happens.
Edit: Instead of rejecting gossip messages on post fork topics, we now subscribe to post fork topics 2 slots before the fork.
Co-authored-by: Age Manning <Age@AgeManning.com>
# Description
A few changes have been made to discovery. In particular a custom re-write of an LRU cache which previously was read/write O(N) for all our sessions ~5k, to a more reasonable hashmap-style O(1).
Further there has been reported issues in the current discv5, so added error handling to help identify the issue has been added.
The identify network debug logs can get quite noisy and are unnecessary to print on every request/response.
This PR reduces debug noise by only printing messages for identify messages that offer some new information.
## Issue Addressed
Related to: #2259
Made an attempt at all the necessary updates here to publish the crates to crates.io. I incremented the minor versions on all the crates that have been previously published. We still might run into some issues as we try to publish because I'm not able to test this out but I think it's a good starting point.
## Proposed Changes
- Add description and license to `ssz_types` and `serde_util`
- rename `serde_util` to `eth2_serde_util`
- increment minor versions
- remove path dependencies
- remove patch dependencies
## Additional Info
Crates published:
- [x] `tree_hash` -- need to publish `tree_hash_derive` and `eth2_hashing` first
- [x] `eth2_ssz_types` -- need to publish `eth2_serde_util` first
- [x] `tree_hash_derive`
- [x] `eth2_ssz`
- [x] `eth2_ssz_derive`
- [x] `eth2_serde_util`
- [x] `eth2_hashing`
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
Resolves#2033
## Proposed Changes
Adds a flag to enable shutting down beacon node right after sync is completed.
## Additional Info
Will need modification after weak subjectivity sync is enabled to change definition of a fully synced node.
## Issue Addressed
NA
## Proposed Changes
This PR expands the time that entries exist in the gossip-sub duplicate cache. Recent investigations found that this cache is one slot (12s) shorter than the period for which an attestation is permitted to propagate on the gossip network.
Before #2540, this was causing peers to be unnecessarily down-scored for sending old attestations. Although that issue has been fixed, the duplicate cache time is increased here to avoid such messages from getting any further up the networking stack then required.
## Additional Info
NA
Due to the altair fork, in principle we can now subscribe to up to 148 topics. This bypasses our original limit and we can end up rejecting subscriptions.
This PR increases the limit to account for the fork.
## Issue Addressed
N/A
## Proposed Changes
Refactor discovery queries such that only `QueryType::Subnet` queries are queued for grouping. `QueryType::FindPeers` is always made regardless of the number of active `Subnet` queries (max = 2). This prevents `FindPeers` queries from getting starved if `Subnet` queries start queuing up.
Also removes `GroupedQueryType` struct and use `QueryType` for all queuing and processing of discovery requests.
## Additional Info
Currently, no distinction is made between subnet discovery queries for attestation and sync committee subnets when grouping queries. Potentially we could prioritise attestation queries over sync committee queries in the future.
## Proposed Changes
* Consolidate Tokio versions: everything now uses the latest v1.10.0, no more `tokio-compat`.
* Many semver-compatible changes via `cargo update`. Notably this upgrades from the yanked v0.8.0 version of crossbeam-deque which is present in v1.5.0-rc.0
* Many semver incompatible upgrades via `cargo upgrades` and `cargo upgrade --workspace pkg_name`. Notable ommissions:
- Prometheus, to be handled separately: https://github.com/sigp/lighthouse/issues/2485
- `rand`, `rand_xorshift`: the libsecp256k1 package requires 0.7.x, so we'll stick with that for now
- `ethereum-types` is pinned at 0.11.0 because that's what `web3` is using and it seems nice to have just a single version
## Additional Info
We still have two versions of `libp2p-core` due to `discv5` depending on the v0.29.0 release rather than `master`. AFAIK it should be OK to release in this state (cc @AgeManning )
## Issue Addressed
- Resolves#2457
- Resolves#2443
## Proposed Changes
Target the (presently unreleased) head of `libp2p/rust-libp2p:master` in order to obtain the fix from https://github.com/libp2p/rust-libp2p/pull/2175.
Additionally:
- `libsecp256k1` needed to be upgraded to satisfy the new version of `libp2p`.
- There were also a handful of minor changes to `eth2_libp2p` to suit some interface changes.
- Two `cargo audit --ignore` flags were remove due to libp2p upgrades.
## Additional Info
NA
## Issue Addressed
N/A
## Proposed Changes
- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
* Adjust beacon node timeouts for validator client HTTP requests (#2352)
Resolves#2313
Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.
Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.
Add a CLI flag called `use-long-timeouts` to revert to the old behavior.
Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Use read_recursive locks in database (#2417)
Closes#2245
Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.
* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
* Return more detail when invalid data is found in the DB during startup (#2445)
- Resolves#2444
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
NA
* Use hardware acceleration for SHA256 (#2426)
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
* Start a release checklist (#2270)
NA
Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.
NA
* Serious banning
* fmt
Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
This updates some older dependencies to address a few cargo audit warnings.
The majority of warnings come from network dependencies which will be addressed in #2389.
This PR contains some minor dep updates that are not network related.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
`make lint` failing on rust 1.53.0.
## Proposed Changes
1.53.0 updates
## Additional Info
I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
Reverts #2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again.
## Additional Info
NA
## Issue Addressed
#2325
## Proposed Changes
This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum.
## Additional Info
Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers.
EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected.
```rust
let previous_state = info.score_state();
// Update scores
info.score_update();
Self::handle_score_transitions(
previous_state,
peer_id,
info,
&mut to_ban_peers,
&mut to_unban_peers,
&mut self.events,
&self.log,
);
```
```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status.
Co-authored-by: Kevin Lu <kevlu93@gmail.com>
The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update.
This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial.
This makes #2374 obsolete.
## Issue Addressed
The latest version of Rust has new clippy rules & the codebase isn't up to date with them.
## Proposed Changes
Small formatting changes that clippy tells me are functionally equivalent
## Issue Addressed
N/A
## Proposed Changes
Add unit tests for the various CLI flags associated with the beacon node and validator client. These changes require the addition of two new flags: `dump-config` and `immediate-shutdown`.
## Additional Info
Both `dump-config` and `immediate-shutdown` are marked as hidden since they should only be used in testing and other advanced use cases.
**Note:** This requires changing `main.rs` so that the flags can adjust the program behavior as necessary.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
#2107
## Proposed Change
The peer manager will mark peers as disconnected in the discv5 DHT when they disconnect or dial fails
## Additional Info
Rationale for this particular change is explained in my comment on #2107
## Issue Addressed
Resolves#2094
## Proposed Changes
Fixes scripts for creating local testnets. Adds an option in `lighthouse boot_node` to run with a previously generated enr.
## Issue Addressed
Closes#2274
## Proposed Changes
* Modify the `YamlConfig` to collect unknown fields into an `extra_fields` map, instead of failing hard.
* Log a debug message if there are extra fields returned to the VC from one of its BNs.
This restores Lighthouse's compatibility with Teku beacon nodes (and therefore Infura)
## Issue Addressed
- Resolves#2215
## Proposed Changes
Addresses a potential loop when the majority of peers indicate that we are contactable via an IPv6 address.
See https://github.com/sigp/discv5/pull/62 for further rationale.
## Additional Info
The alternative to this PR is to use `--disable-enr-auto-update` and then manually supply an `--enr-address` and `--enr-upd-port`. However, that requires the user to know their IP addresses in order for discovery to work properly. This might not be practical/achievable for some users, hence this hotfix.
## Issue Addressed
NA
## Proposed Changes
Rust 1.50 has landed 🎉
The shiny new `clippy` peers down upon us mere mortals with disgust. Brutish peasants wrapping our `usize`s in superfluous `Option`s... tsk tsk.
I've performed the goat sacrifice and corrected our evil ways in this PR. Tonight we shall pray that Github Actions bestows the almighty green tick upon us.
## Additional Info
NA
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
resolves#2129resolves#2099
addresses some of #1712
unblocks #2076
unblocks #2153
## Proposed Changes
- Updates all the dependencies mentioned in #2129, except for web3. They haven't merged their tokio 1.0 update because they are waiting on some dependencies of their own. Since we only use web3 in tests, I think updating it in a separate issue is fine. If they are able to merge soon though, I can update in this PR.
- Updates `tokio_util` to 0.6.2 and `bytes` to 1.0.1.
- We haven't made a discv5 release since merging tokio 1.0 updates so I'm using a commit rather than release atm. **Edit:** I think we should merge an update of `tokio_util` to 0.6.2 into discv5 before this release because it has panic fixes in `DelayQueue` --> PR in discv5: https://github.com/sigp/discv5/pull/58
## Additional Info
tokio 1.0 changes that required some changes in lighthouse:
- `interval.next().await.is_some()` -> `interval.tick().await`
- `sleep` future is now `!Unpin` -> https://github.com/tokio-rs/tokio/issues/3028
- `try_recv` has been temporarily removed from `mpsc` -> https://github.com/tokio-rs/tokio/issues/3350
- stream features have moved to `tokio-stream` and `broadcast::Receiver::into_stream()` has been temporarily removed -> `https://github.com/tokio-rs/tokio/issues/2870
- I've copied over the `BroadcastStream` wrapper from this PR, but can update to use `tokio-stream` once it's merged https://github.com/tokio-rs/tokio/pull/3384
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Fixes a bug that I missed during a review in #2163. I found this bug by observing that nodes were receiving far less attestations (~1/2 of previous).
I'm not certain on *exactly* how this mistake manifested in a reduction in attestations, but the mistake touches so much code that I think it's reasonable to declare that this it the cause of the observed issue (drop in attestations).
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Copied from #2083, changes the config milliseconds_per_slot to seconds_per_slot to avoid errors when slot duration is not a multiple of a second. To avoid deserializing old serialized data (with milliseconds instead of seconds) the Serialize and Deserialize derive got removed from the Spec struct (isn't currently used anyway).
This PR replaces #2083 for the purpose of fixing a merge conflict without requiring the input of @blacktemplar.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
NA
## Proposed Changes
As per #2100, uses derives from the sturm library to implement AsRef<str> and AsStaticRef to easily get str values from enums without creating new Strings. Furthermore unifies all attestation error counter into one IntCounterVec vector.
These works are originally by @blacktemplar, I've just created this PR so I can resolve some merge conflicts.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
`test_dht_persistence` failing
## Proposed Changes
Bind `NetworkService::start` to an underscore prefixed variable rather than `_`. `_` was causing it to be dropped immediately
This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it
Co-authored-by: realbigsean <seananderson33@gmail.com>