* Adjust beacon node timeouts for validator client HTTP requests (#2352)
Resolves#2313
Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.
Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.
Add a CLI flag called `use-long-timeouts` to revert to the old behavior.
Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Use read_recursive locks in database (#2417)
Closes#2245
Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.
* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
* Return more detail when invalid data is found in the DB during startup (#2445)
- Resolves#2444
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
NA
* Use hardware acceleration for SHA256 (#2426)
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
* Start a release checklist (#2270)
NA
Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.
NA
* Serious banning
* fmt
Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
## Additional Info
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
This updates some older dependencies to address a few cargo audit warnings.
The majority of warnings come from network dependencies which will be addressed in #2389.
This PR contains some minor dep updates that are not network related.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Closes#1661
## Proposed Changes
Add a dummy package called `target_check` which gets compiled early in the build and fails if the target is 32-bit
## Additional Info
You can test the efficacy of this check with:
```
cross build --release --manifest-path lighthouse/Cargo.toml --target i686-unknown-linux-gnu
```
In which case this compilation error is shown:
```
error: Lighthouse requires a 64-bit CPU and operating system
--> common/target_check/src/lib.rs:8:1
|
8 | / assert_cfg!(
9 | | target_pointer_width = "64",
10 | | "Lighthouse requires a 64-bit CPU and operating system",
11 | | );
| |__^
```
## Issue Addressed
Closes#2354
## Proposed Changes
Add a `minify` method to `slashing_protection::Interchange` that keeps only the maximum-epoch attestation and maximum-slot block for each validator. Specifically, `minify` constructs "synthetic" attestations (with no `signing_root`) containing the maximum source epoch _and_ the maximum target epoch from the input. This is equivalent to the `minify_synth` algorithm that I've formally verified in this repository:
https://github.com/michaelsproul/slashing-proofs
## Additional Info
Includes the JSON loading optimisation from #2347
## Issue Addressed
NA
## Proposed Changes
I've noticed some of the SigP Prater nodes struggling on v1.4.0-rc.0. I suspect this is due to the changes in #2296. Specifically, the trade-off which lowered the memory footprint whilst increasing runtime on some functions.
Presently, this PR is documenting my testing on Prater.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Bump versions.
## Additional Info
This is not exactly the v1.4.0 release described in [Lighthouse Update #36](https://lighthouse.sigmaprime.io/update-36.html).
Whilst it contains:
- Beta Windows support
- A reduction in Eth1 queries
- A reduction in memory footprint
It does not contain:
- Altair
- Doppelganger Protection
- The remote signer
We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible.
## TODO
- [x] Wait for #2340, #2356 and #2376 to merge and then rebase on `unstable`.
- [x] Ensure discovery issues are fixed (see #2388)
- [x] Ensure https://github.com/sigp/lighthouse/pull/2382 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2383 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2384 is merged/removed.
- [ ] Double-check eth1 cache is carried between boots
## Issue Addressed
NA
## Proposed Changes
Reverts #2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again.
## Additional Info
NA
The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update.
This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial.
This makes #2374 obsolete.
## Issue Addressed
NA
## Proposed Changes
Modify the configuration of [GNU malloc](https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html) to reduce memory footprint.
- Set `M_ARENA_MAX` to 4.
- This reduces memory fragmentation at the cost of contention between threads.
- Set `M_MMAP_THRESHOLD` to 2mb
- This means that any allocation >= 2mb is allocated via an anonymous mmap, instead of on the heap/arena. This reduces memory fragmentation since we don't need to keep growing the heap to find big contiguous slabs of free memory.
- ~~Run `malloc_trim` every 60 seconds.~~
- ~~This shaves unused memory from the top of the heap, preventing the heap from constantly growing.~~
- Removed, see: https://github.com/sigp/lighthouse/pull/2299#issuecomment-825322646
*Note: this only provides memory savings on the Linux (glibc) platform.*
## Additional Info
I'm going to close#2288 in favor of this for the following reasons:
- I've managed to get the memory footprint *smaller* here than with jemalloc.
- This PR seems to be less of a dramatic change than bringing in the jemalloc dep.
- The changes in this PR are strictly runtime changes, so we can create CLI flags which disable them completely. Since this change is wide-reaching and complex, it's nice to have an easy "escape hatch" if there are undesired consequences.
## TODO
- [x] Allow configuration via CLI flags
- [x] Test on Mac
- [x] Test on RasPi.
- [x] Determine if GNU malloc is present?
- I'm not quite sure how to detect for glibc.. This issue suggests we can't really: https://github.com/rust-lang/rust/issues/33244
- [x] Make a clear argument regarding the affect of this on CPU utilization.
- [x] Test with higher `M_ARENA_MAX` values.
- [x] Test with longer trim intervals
- [x] Add some stats about memory savings
- [x] Remove `malloc_trim` calls & code
## Issue Addressed
Windows incompatibility.
## Proposed Changes
On windows, lighthouse needs to default to STDIN as tty doesn't exist. Also Windows uses ACLs for file permissions. So to mirror chmod 600, we will remove every entry in a file's ACL and add only a single SID that is an alias for the file owner.
Beyond that, there were several changes made to different unit tests because windows has slightly different error messages as well as frustrating nuances around killing a process :/
## Additional Info
Tested on my Windows VM and it appears to work, also compiled & tested on Linux with these changes. Permissions look correct on both platforms now. Just waiting for my validator to activate on Prater so I can test running full validator client on windows.
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
N/A
## Proposed Changes
Add unit tests for the various CLI flags associated with the beacon node and validator client. These changes require the addition of two new flags: `dump-config` and `immediate-shutdown`.
## Additional Info
Both `dump-config` and `immediate-shutdown` are marked as hidden since they should only be used in testing and other advanced use cases.
**Note:** This requires changing `main.rs` so that the flags can adjust the program behavior as necessary.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
#2276
## Proposed Changes
Add the `SensitiveUrl` struct which wraps `Url` and implements custom `Display` and `Debug` traits to redact user secrets from being logged in eth1 endpoints, beacon node endpoints and metrics.
## Additional Info
This also includes a small rewrite of the eth1 crate to make requests using `Url` instead of `&str`.
Some error messages have also been changed to remove `Url` data.
## Issue Addressed
`cargo audit` failing due to a vuln in `openssl`
## Proposed Changes
Updates to the `Cargo.lock` made as a result of running `cargo audit fix`
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Bump versions.
## Additional Info
This is a minor release (not patch) due to the very slight change introduced by #2291.
## Proposed Changes
Use two instances of max cover when packing attestations into blocks: one for the previous epoch, and one for the current epoch. This reduces the amount of computation done by roughly half due to the `O(n^2)` running time of max cover (`2 * (n/2)^2 = n^2/2`). This should help alleviate some load on block proposal, particularly on Prater.
## Issue Addressed
Resolves#2094
## Proposed Changes
Fixes scripts for creating local testnets. Adds an option in `lighthouse boot_node` to run with a previously generated enr.
## Issue Addressed
Closes#2052
## Proposed Changes
- Refactor the attester/proposer duties endpoints in the BN
- Performance improvements
- Fixes some potential inconsistencies with the dependent root fields.
- Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead.
- Move the code for the proposer/attester duties endpoints into separate files, for readability.
- Refactor the `DutiesService` in the VC
- Required to reduce the delay on broadcasting new blocks.
- Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API.
- Separate block/attestation duty tasks so that they don't block each other when one is slow.
- In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes.
- Unfortunately this has created lots of dust changes.
- In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont).
- Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code.
- This also fixes a bug with some functions which were failing to include a state root as per [this comment](072695284f/consensus/state_processing/src/state_advance.rs (L69-L74)). I couldn't find any instance of this bug that resulted in anything more severe than keying a shuffling cache by the wrong block root.
- Swap the VC block service to use `mpsc` from `tokio` instead of `futures`. This is consistent with the rest of the code base.
~~This PR *reduces* the size of the codebase 🎉~~ It *used* to reduce the size of the code base before I added more comments.
## Observations on Prymont
- Proposer duties times down from peaks of 450ms to consistent <1ms.
- Current epoch attester duties times down from >1s peaks to a consistent 20-30ms.
- Block production down from +600ms to 100-200ms.
## Additional Info
- ~~Blocked on #2241~~
- ~~Blocked on #2234~~
## TODO
- [x] ~~Refactor this into some smaller PRs?~~ Leaving this as-is for now.
- [x] Address `per_slot_processing` roots.
- [x] Investigate slow next epoch times. Not getting added to cache on block processing?
- [x] Consider [this](072695284f/beacon_node/store/src/hot_cold_store.rs (L811-L812)) in the scenario of replacing the state roots
Co-authored-by: pawan <pawandhananjay@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
This is a small PR which prevents unwanted bootnodes from being added to the DHT and being dialed when the `--disable-discovery` flag is set.
The main reason one would want to disable discovery is to connect to a fix set of peers. Currently, regardless of what the user does, Lighthouse will populate its DHT with previously known peers and also fill it with the spec's bootnodes. It will then dial the bootnodes that are capable of being dialed. This prevents testing with a fixed peer list.
This PR prevents these excess nodes from being added and dialed if the user has set `--disable-discovery`.
## Issue Addressed
Resolves#1788
## Proposed Changes
As per #1788, expose the time at which the process started via the `process_start_time_seconds` Prometheus metric. This will help users track uptime.
## Additional Info
NA
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Problem this PR addresses
There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:
1. Gossip block 0xabc arrives ~200ms early
- It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
- However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
- The peer that sent the attestation is down-voted.
- Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
- When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).
## Potential solutions
I can think of three solutions to this:
1. Wait for attestation-queuing (#635) to arrive and solve this.
- Easy
- Not immediate fix.
- Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
- Easy
- ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
- More difficult
- Feels like the best solution, I will try to implement this.
**This PR takes approach (3).**
## Changes included
- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
- To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
- The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
- To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
## Proposed Changes
Somehow since Lighthouse v1.1.3 the behaviour of `git-describe` has changed so that it includes the version tag, the number of commits since that tag, _and_ the commit. According to the docs this is how it should always have behaved?? Weird!
https://git-scm.com/docs/git-describe/2.30.1
Anyway, this lead to `lighthouse_version` producing this monstrosity of a version string when building #2194:
```
Lighthouse/v1.1.3-v1.1.3-5-gac07
```
Observe it in the wild here: https://pyrmont.beaconcha.in/block/694880
Adding `--exclude="*"` prevents `git-describe` from trying to include the tag, and on that troublesome commit from #2194 it now produces the correct version string.
## Issue Addressed
- Resolves#2215
## Proposed Changes
Addresses a potential loop when the majority of peers indicate that we are contactable via an IPv6 address.
See https://github.com/sigp/discv5/pull/62 for further rationale.
## Additional Info
The alternative to this PR is to use `--disable-enr-auto-update` and then manually supply an `--enr-address` and `--enr-upd-port`. However, that requires the user to know their IP addresses in order for discovery to work properly. This might not be practical/achievable for some users, hence this hotfix.
## Issue Addressed
resolves#2129resolves#2099
addresses some of #1712
unblocks #2076
unblocks #2153
## Proposed Changes
- Updates all the dependencies mentioned in #2129, except for web3. They haven't merged their tokio 1.0 update because they are waiting on some dependencies of their own. Since we only use web3 in tests, I think updating it in a separate issue is fine. If they are able to merge soon though, I can update in this PR.
- Updates `tokio_util` to 0.6.2 and `bytes` to 1.0.1.
- We haven't made a discv5 release since merging tokio 1.0 updates so I'm using a commit rather than release atm. **Edit:** I think we should merge an update of `tokio_util` to 0.6.2 into discv5 before this release because it has panic fixes in `DelayQueue` --> PR in discv5: https://github.com/sigp/discv5/pull/58
## Additional Info
tokio 1.0 changes that required some changes in lighthouse:
- `interval.next().await.is_some()` -> `interval.tick().await`
- `sleep` future is now `!Unpin` -> https://github.com/tokio-rs/tokio/issues/3028
- `try_recv` has been temporarily removed from `mpsc` -> https://github.com/tokio-rs/tokio/issues/3350
- stream features have moved to `tokio-stream` and `broadcast::Receiver::into_stream()` has been temporarily removed -> `https://github.com/tokio-rs/tokio/issues/2870
- I've copied over the `BroadcastStream` wrapper from this PR, but can update to use `tokio-stream` once it's merged https://github.com/tokio-rs/tokio/pull/3384
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Copied from #2083, changes the config milliseconds_per_slot to seconds_per_slot to avoid errors when slot duration is not a multiple of a second. To avoid deserializing old serialized data (with milliseconds instead of seconds) the Serialize and Deserialize derive got removed from the Spec struct (isn't currently used anyway).
This PR replaces #2083 for the purpose of fixing a merge conflict without requiring the input of @blacktemplar.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>